SlideShare a Scribd company logo
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
PyTorch 2 internals
A not so short guide to recent PyTorch innovations
Christian S. Perone
(christian.perone@gmail.com)
http://blog.christianperone.com
London, UK, Dec 2023
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Who Am I
▸ Christian S. Perone
▸ ML Research Engineer in London/UK
▸ Blog at
▸ blog.christianperone.com
▸ Open-source projects at
▸ https://github.com/perone
▸ Twitter @tarantulae
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Disclaimer
PyTorch development pace is so fast that no man ever
steps in PyTorch code twice, for it’s not the same code
and he’s not the same man.
—Heraclitus, 500 BC
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Section I
; Tensors <
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Tensors
Simply put, tensors are a generalization of vectors and matrices. In
PyTorch, they are a multi-dimensional matrix containing elements of
a single data type.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Tensors
Simply put, tensors are a generalization of vectors and matrices. In
PyTorch, they are a multi-dimensional matrix containing elements of
a single data type.
>>> import torch
>>> t = torch.tensor([[1., -1.], [1., -1.]])
>>> t
tensor([[ 1., -1.]
[ 1., -1.]])
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Tensors
Simply put, tensors are a generalization of vectors and matrices. In
PyTorch, they are a multi-dimensional matrix containing elements of
a single data type.
>>> import torch
>>> t = torch.tensor([[1., -1.], [1., -1.]])
>>> t
tensor([[ 1., -1.]
[ 1., -1.]])
>>> t.dtype # They have a type
torch.float32
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Tensors
Simply put, tensors are a generalization of vectors and matrices. In
PyTorch, they are a multi-dimensional matrix containing elements of
a single data type.
>>> import torch
>>> t = torch.tensor([[1., -1.], [1., -1.]])
>>> t
tensor([[ 1., -1.]
[ 1., -1.]])
>>> t.dtype # They have a type
torch.float32
>>> t.shape # a shape
torch.Size([2, 2])
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Tensors
Simply put, tensors are a generalization of vectors and matrices. In
PyTorch, they are a multi-dimensional matrix containing elements of
a single data type.
>>> import torch
>>> t = torch.tensor([[1., -1.], [1., -1.]])
>>> t
tensor([[ 1., -1.]
[ 1., -1.]])
>>> t.dtype # They have a type
torch.float32
>>> t.shape # a shape
torch.Size([2, 2])
>>> t.device # and live in some device
device(type='cpu')
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Tensors
▸ Although PyTorch has an elegant python first design, all PyTorch
heavy work is actually implemented in C++.
▸ In Python, the integration of C++ code is (usually) done using
what is called an extension;
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Tensors
▸ Although PyTorch has an elegant python first design, all PyTorch
heavy work is actually implemented in C++.
▸ In Python, the integration of C++ code is (usually) done using
what is called an extension;
▸ PyTorch uses ATen, which is the foundational tensor operation
library on which all else is built;
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Tensors
▸ Although PyTorch has an elegant python first design, all PyTorch
heavy work is actually implemented in C++.
▸ In Python, the integration of C++ code is (usually) done using
what is called an extension;
▸ PyTorch uses ATen, which is the foundational tensor operation
library on which all else is built;
▸ To do automatic differentiation, PyTorch uses Autograd, which is
an augmentation on top of the ATen framework;
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Tensors
▸ Although PyTorch has an elegant python first design, all PyTorch
heavy work is actually implemented in C++.
▸ In Python, the integration of C++ code is (usually) done using
what is called an extension;
▸ PyTorch uses ATen, which is the foundational tensor operation
library on which all else is built;
▸ To do automatic differentiation, PyTorch uses Autograd, which is
an augmentation on top of the ATen framework;
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Quick recap Python objects
typedef struct {
PyObject_HEAD
double ob_fval;
} PyFloatObject;
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Quick recap Python objects
typedef struct {
PyObject_HEAD
double ob_fval;
} PyFloatObject;
typedef struct _object {
Py_ssize_t ob_refcnt;
struct _typeobject *ob_type;
} PyObject;
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Quick recap Python objects
typedef struct {
PyObject_HEAD
double ob_fval;
} PyFloatObject;
typedef struct _object {
Py_ssize_t ob_refcnt;
struct _typeobject *ob_type;
} PyObject;
struct _typeobject *ob_type
Py_ssize_t ob_refcnt
object
PyObject
double ob_fval
PyObject_HEAD
object
PyFloatObject
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Quick recap Python objects
struct THPVariable {
PyObject_HEAD;
c10::MaybeOwned<at::Tensor> cdata;
PyObject* backward_hooks = nullptr;
PyObject* post_accumulate_grad_hooks = nullptr;
};
The TH prefix is from TorcH, and P means Python.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Quick recap Python objects
struct THPVariable {
PyObject_HEAD;
c10::MaybeOwned<at::Tensor> cdata;
PyObject* backward_hooks = nullptr;
PyObject* post_accumulate_grad_hooks = nullptr;
};
(object fields)
PyObject_HEAD (w/ ref counter)
object
THPVariable
variable_a
variable_b
Ref Count = 1
Ref Count = 2
The TH prefix is from TorcH, and P means Python.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
In Python, everything is an object
>>> a = 300
>>> b = 300
>>> a is b
False
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
In Python, everything is an object
>>> a = 300
>>> b = 300
>>> a is b
False
>>> a = 200
>>> b = 200
>>> a is b
True
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
In Python, everything is an object
>>> a = 300
>>> b = 300
>>> a is b
False
>>> a = 200
>>> b = 200
>>> a is b
True (object fields)
PyObject_HEAD
object
PyIntObject
a
b
Ref Count = 1
Ref Count = 2
(object fields)
PyObject_HEAD
object
PyIntObject
(object fields)
PyObject_HEAD
object
PyIntObject
a
b
Ref Count = 1
Ref Count = 1
A typical Python program spend much of its time
allocating/deallocating integers. CPython then caches the small
integers.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Zero-copying tensors
It is very common to load tensors in numpy and convert them to
PyTorch, or vice-versa;
>>> np_array = np.ones((2,2))
>>> np_array
array([[1., 1.],
[1., 1.]])
Underline after an operation means an in-place operation.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Zero-copying tensors
It is very common to load tensors in numpy and convert them to
PyTorch, or vice-versa;
>>> np_array = np.ones((2,2))
>>> np_array
array([[1., 1.],
[1., 1.]])
>>> torch_array = torch.tensor(np_array)
>>> torch_array
tensor([[1., 1.],
[1., 1.]], dtype=torch.float64)
Underline after an operation means an in-place operation.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Zero-copying tensors
It is very common to load tensors in numpy and convert them to
PyTorch, or vice-versa;
>>> np_array = np.ones((2,2))
>>> np_array
array([[1., 1.],
[1., 1.]])
>>> torch_array = torch.tensor(np_array)
>>> torch_array
tensor([[1., 1.],
[1., 1.]], dtype=torch.float64)
>>> torch_array.add_(1.0)
Underline after an operation means an in-place operation.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Zero-copying tensors
It is very common to load tensors in numpy and convert them to
PyTorch, or vice-versa;
>>> np_array = np.ones((2,2))
>>> np_array
array([[1., 1.],
[1., 1.]])
>>> torch_array = torch.tensor(np_array)
>>> torch_array
tensor([[1., 1.],
[1., 1.]], dtype=torch.float64)
>>> torch_array.add_(1.0)
>>> np_array
array([[1., 1.], # array is intact, a copy was made
[1., 1.]])
Underline after an operation means an in-place operation.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Zero-copying tensors
▸ Now imagine that you have a batch of 128 images, 3 channels each
(RGB) and with size of 224x224;
0
1
1
1
1
0
0
1
0
1
0
0
0
1
1
0
1
0
1
1
1
1
1
0
1
0
0
1
1
0
1
1
1
0
1
0
1
0
1
0
1
1
1
1
1
1
1
0
Column
Row
Channel
▸ This will yield a size in memory of ∼ 74MB. We don’t want to
duplicate memory (except when copying them to discrete GPUs
of course);
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Zero-copying tensors
Let’s see now a slightly different code using the function
torch.from_numpy() this time:
>>> np_array
array([[1., 1.],
[1., 1.]])
>>> torch_array = torch.from_numpy(np_array)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Zero-copying tensors
Let’s see now a slightly different code using the function
torch.from_numpy() this time:
>>> np_array
array([[1., 1.],
[1., 1.]])
>>> torch_array = torch.from_numpy(np_array)
>>> torch_array.add_(1.0)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Zero-copying tensors
Let’s see now a slightly different code using the function
torch.from_numpy() this time:
>>> np_array
array([[1., 1.],
[1., 1.]])
>>> torch_array = torch.from_numpy(np_array)
>>> torch_array.add_(1.0)
>>> np_array
array([[2., 2.],
[2., 2.]])
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Zero-copying tensors
Let’s see now a slightly different code using the function
torch.from_numpy() this time:
>>> np_array
array([[1., 1.],
[1., 1.]])
>>> torch_array = torch.from_numpy(np_array)
>>> torch_array.add_(1.0)
>>> np_array
array([[2., 2.],
[2., 2.]])
The original numpy array was changed, because it used a zero-copy
operation.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Zero-copying tensors
Difference between in-place and standard operations might not be so
clear in some cases:
>>> np_array
array([[1., 1.],
[1., 1.]])
>>> torch_array = torch.from_numpy(np_array)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Zero-copying tensors
Difference between in-place and standard operations might not be so
clear in some cases:
>>> np_array
array([[1., 1.],
[1., 1.]])
>>> torch_array = torch.from_numpy(np_array)
>>> np_array = np_array + 1.0
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Zero-copying tensors
Difference between in-place and standard operations might not be so
clear in some cases:
>>> np_array
array([[1., 1.],
[1., 1.]])
>>> torch_array = torch.from_numpy(np_array)
>>> np_array = np_array + 1.0
>>> torch_array
tensor([[1., 1.],
[1., 1.]], dtype=torch.float64)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Zero-copying tensors
Difference between in-place and standard operations might not be so
clear in some cases:
>>> np_array
array([[1., 1.],
[1., 1.]])
>>> torch_array = torch.from_numpy(np_array)
>>> np_array = np_array + 1.0
>>> torch_array
tensor([[1., 1.],
[1., 1.]], dtype=torch.float64)
However, if you use np_array += 1.0 , that is an in-place operation
that will change torch_array memory.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Zero-copying tensors
at::Tensor tensor_from_numpy(PyObject* obj, (omitted)) {
// some parts omitted for brevity
auto array = (PyArrayObject*)obj;
int ndim = PyArray_NDIM(array);
auto sizes = to_aten_shape(ndim, PyArray_DIMS(array));
auto strides = to_aten_shape(ndim, PyArray_STRIDES(array));
void* data_ptr = PyArray_DATA(array);
Py_INCREF(obj);
return at::lift_fresh(at::from_blob(
data_ptr, sizes, strides,
[obj](void* data) {
pybind11::gil_scoped_acquire gil;
Py_DECREF(obj);
},
at::device(kCPU).dtype(numpy_dtype_to_aten(PyArray_TYPE(array)))
}
Pay attention to the reference counting using Py_INCREF() and the
call to at::from_blob() function.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Data pointers
(object fields)
data_pointer*
object
PyArrayObject
(object fields)
data_pointer*
object
FloatTensor
The tensor FloatTensor did a copy of the numpy array data
pointer and not of the contents. The reference is kept safe by the
Python reference counting mechanism.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Tensor Storage
The abstraction responsible for holding the data isn’t actually the
Tensor , but the Storage .
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Tensor Storage
The abstraction responsible for holding the data isn’t actually the
Tensor , but the Storage .
struct C10_API StorageImpl : public c10::intrusive_ptr_target {
// (...)
private:
// (...)
DataPtr data_ptr_;
SymInt size_bytes_;
Allocator* allocator_;
// (...)
}
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Tensor Storage
The abstraction responsible for holding the data isn’t actually the
Tensor , but the Storage .
struct C10_API StorageImpl : public c10::intrusive_ptr_target {
// (...)
private:
// (...)
DataPtr data_ptr_;
SymInt size_bytes_;
Allocator* allocator_;
// (...)
}
▸ Holds a pointer to the raw data and contains information such as
the size and allocator;
▸ Storage is a dumb abstraction, there is no metadata telling us how
to interpret the data it holds;
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Tensor Storage
▸ The Storage abstraction is very powerful because it decouples
the raw data and how we can interpret it;
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Tensor Storage
▸ The Storage abstraction is very powerful because it decouples
the raw data and how we can interpret it;
▸ We can have multiple tensors sharing the same storage, but with
different interpretations, also called views, but without
duplicating memory:
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Tensor Storage
▸ The Storage abstraction is very powerful because it decouples
the raw data and how we can interpret it;
▸ We can have multiple tensors sharing the same storage, but with
different interpretations, also called views, but without
duplicating memory:
>>> x = torch.ones((2, 2))
>>> x_view = x.view(4)
>>> x_data = x.untyped_storage().data_ptr()
>>> x_view_data = x_view.untyped_storage().data_ptr()
>>> x_data == x_view_data
True
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Tensor Storage
▸ The Storage abstraction is very powerful because it decouples
the raw data and how we can interpret it;
▸ We can have multiple tensors sharing the same storage, but with
different interpretations, also called views, but without
duplicating memory:
>>> x = torch.ones((2, 2))
>>> x_view = x.view(4)
>>> x_data = x.untyped_storage().data_ptr()
>>> x_view_data = x_view.untyped_storage().data_ptr()
>>> x_data == x_view_data
True
▸ x_view is a different view (interpretation) of the same data
present in the underlying storage that is shared between both
tensors.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Memory allocators (CPU/GPU)
▸ The tensor storage can be allocated either in the CPU memory or
GPU, therefore a mechanism is required to switch between these
different allocations:
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Memory allocators (CPU/GPU)
▸ The tensor storage can be allocated either in the CPU memory or
GPU, therefore a mechanism is required to switch between these
different allocations:
struct C10_API Allocator {
virtual ~Allocator() = default;
virtual DataPtr allocate(size_t n) const = 0;
virtual DeleterFnPtr raw_deleter() const {...}
void* raw_allocate(size_t n) {...}
void raw_deallocate(void* ptr) {...}
};
▸ There are Allocator s that will use the GPU allocators such as
cudaMalloc() when the storage should be used for the GPU
or posix_memalign() POSIX functions for data in the CPU
memory.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
CUDA caching allocator
PyTorch uses a CUDA caching allocator that maintains a cache of
allocations with the Block structure:
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
CUDA caching allocator
PyTorch uses a CUDA caching allocator that maintains a cache of
allocations with the Block structure:
struct Block {
int device; // gpu
cudaStream_t stream; // allocation stream
size_t size; // block size in bytes
BlockPool* pool{nullptr}; // owning memory pool
void* ptr{nullptr}; // memory address
bool allocated{false}; // in-use flag
Block* prev{nullptr}; // prev block if split from a
Block* next{nullptr}; // next block if split from a
// (...)
}
The torch.cuda.empty_cache() will release all unused blocks.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
The big picture
(object fields)
Storage *storage
object
Tensor
Allocator *allocator
(object fields)
DataPtr data_ptr
object
Storage
raw_deallocate()
(object fields)
raw_allocate()
object
Allocator
Raw Data
▸ The Tensor has a Storage which in turn has a pointer to the
raw data and to the Allocator to allocate memory according
to the destination device.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Section II
; JIT <
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
JIT - Just-in-time compiler
▸ PyTorch is eager by design, which means that it is easily hackable
to debug, inspect, etc;
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
JIT - Just-in-time compiler
▸ PyTorch is eager by design, which means that it is easily hackable
to debug, inspect, etc;
▸ However, this poses problems for optimization and for
decoupling it from Python (the model itself is Python code);
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
JIT - Just-in-time compiler
▸ PyTorch is eager by design, which means that it is easily hackable
to debug, inspect, etc;
▸ However, this poses problems for optimization and for
decoupling it from Python (the model itself is Python code);
▸ PyTorch 1.0 introduced torch.jit , which has two main
methods to convert a PyTorch model to a serializable and
optimizable format;
▸ TorchScript was also introduced as a statically-typed subset of
Python;
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
JIT - Just-in-time compiler
Two very different worlds with their own requirements.
Prototype, debug, train,
experiment
EAGER MODE
Optimization, other
languages, deployment
SCRIPT MODE
!
"
#
tracing
scripting
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Tracing
def my_function(x):
if x.mean() > 1.0:
r = torch.tensor(1.0)
else:
r = torch.tensor(2.0)
return r
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Tracing
def my_function(x):
if x.mean() > 1.0:
r = torch.tensor(1.0)
else:
r = torch.tensor(2.0)
return r
>>> ftrace = torch.jit.trace(my_function, (torch.ones(2, 2)))
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Tracing
def my_function(x):
if x.mean() > 1.0:
r = torch.tensor(1.0)
else:
r = torch.tensor(2.0)
return r
>>> ftrace = torch.jit.trace(my_function, (torch.ones(2, 2)))
>>> ftrace.graph
graph(%x : Float(2, 2, strides=[2, 1], requires_grad=0, device=cpu)):
%5 : Float(requires_grad=0, device=cpu) = prim::Constant[value={2}]()
%6 : Device = prim::Constant[value="cpu"]()
%7 : int = prim::Constant[value=6]()
%8 : bool = prim::Constant[value=0]()
%9 : bool = prim::Constant[value=0]()
%10 : NoneType = prim::Constant()
%11 : Float(requires_grad=0, device=cpu) = aten::to(%5, %6, %7, %8, %9,
%12 : Float(requires_grad=0, device=cpu) = aten::detach(%11)
return (%12)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Tracing
To call the JIT’ed function, just call the forward() method:
>>> x = torch.ones(2, 2)
>>> ftrace.forward(x)
tensor(2.)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Tracing
To call the JIT’ed function, just call the forward() method:
>>> x = torch.ones(2, 2)
>>> ftrace.forward(x)
tensor(2.)
However, tracing will not record any control-flow like if statements or
loops, it executes the code with the given context and creates the
graph. You can see this limitation below:
>>> x = torch.ones(2, 2).add_(1.0)
>>> ftrace.forward(x)
tensor(2.)
According to my_function() , result should have been 1.0. Tracing
also checks for differences between traced and Python function, but
what about Dropout ?
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Scripting
Another alternative is to use scripting, where you can use decorators
such as @torch.jit.script :
@torch.jit.script
def my_function(x):
if bool(x.mean() > 1.0):
r = 1
else:
r = 2
return r
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Scripting
>>> my_function.graph
graph(%x.1 : Tensor):
%2 : NoneType = prim::Constant()
%4 : float = prim::Constant[value=1.]()
%9 : int = prim::Constant[value=1]()
%10 : int = prim::Constant[value=2]()
%3 : Tensor = aten::mean(%x.1, %2)
%5 : Tensor = aten::gt(%3, %4)
%7 : bool = aten::Bool(%5)
%r : int = prim::If(%7)
block0():
-> (%9)
block1():
-> (%10)
return (%r)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Scripting
The my_function() is now a torch.jit.ScriptFunction :
>>> type(my_function)
torch.jit.ScriptFunction
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Scripting
The my_function() is now a torch.jit.ScriptFunction :
>>> type(my_function)
torch.jit.ScriptFunction
When we check the results again:
>>> x = torch.ones(2, 2)
>>> my_function(x)
2
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Scripting
The my_function() is now a torch.jit.ScriptFunction :
>>> type(my_function)
torch.jit.ScriptFunction
When we check the results again:
>>> x = torch.ones(2, 2)
>>> my_function(x)
2
>>> x = torch.ones(2, 2).add_(1.0)
>>> my_function(x)
1
Control-flow logic was preserved !
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Why TorchScript ?
▸ The concept of having a well-defined Intermediate
Representation (IR) is very powerful, it’s the main concept
behind LLVM platform as well;
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Why TorchScript ?
▸ The concept of having a well-defined Intermediate
Representation (IR) is very powerful, it’s the main concept
behind LLVM platform as well;
▸ This opens the door to:
▸ Decouple the model (computationl graph) from Python runtime;
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Why TorchScript ?
▸ The concept of having a well-defined Intermediate
Representation (IR) is very powerful, it’s the main concept
behind LLVM platform as well;
▸ This opens the door to:
▸ Decouple the model (computationl graph) from Python runtime;
▸ Use it in production with C++ (no GIL) or other languages;
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Why TorchScript ?
▸ The concept of having a well-defined Intermediate
Representation (IR) is very powerful, it’s the main concept
behind LLVM platform as well;
▸ This opens the door to:
▸ Decouple the model (computationl graph) from Python runtime;
▸ Use it in production with C++ (no GIL) or other languages;
▸ Capitalize on optimizations (whole program);
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Why TorchScript ?
▸ The concept of having a well-defined Intermediate
Representation (IR) is very powerful, it’s the main concept
behind LLVM platform as well;
▸ This opens the door to:
▸ Decouple the model (computationl graph) from Python runtime;
▸ Use it in production with C++ (no GIL) or other languages;
▸ Capitalize on optimizations (whole program);
▸ Split the development world of hackable and easy to debug from
the world of putting these models in production and optimize
them.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Building the IR
To build the IR, PyTorch takes leverage of the Python Abstract Syntax
Tree (AST) which is a tree representation of the syntactic structure of
the source code.
>>> ast_mod = ast.parse("print(1 + 2)")
>>> astpretty.pprint(ast_mod.body[0], show_offsets=False)
Expr(
value=Call(
func=Name(id='print', ctx=Load()),
args=[
BinOp(
left=Num(n=1),
op=Add(),
right=Num(n=2),
),
],
keywords=[],
),
)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Building the IR
print(1 + 2)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
PyTorch JIT Phases
Parsing
! Checking
" Optimization
#
Translation
$ Execution
○
&
AST
'
Code
or
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Optimizations
Many optimizations can be used on the computational graph of the
model, such as Loop Unrolling:
for i.. i+= 1 for i.. i+= 4
for j.. for j..
code(i, j) code(i, j)
code(i+1, j)
code(i+2, j)
code(i+3, j)
remainder loop
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Optimizations
Also Peephole optimizations such as:
x.t().t() = x
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Optimizations
Also Peephole optimizations such as:
x.t().t() = x
Example:
def dumb_function(x):
return x.t().t()
>>> traced_fn = torch.jit.trace(dumb_function,
... torch.ones(2,2))
>>> traced_fn.graph_for(torch.ones(2,2))
graph(%x : Tensor):
return (%x)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Optimizations
Also Peephole optimizations such as:
x.t().t() = x
Example:
def dumb_function(x):
return x.t().t()
>>> traced_fn = torch.jit.trace(dumb_function,
... torch.ones(2,2))
>>> traced_fn.graph_for(torch.ones(2,2))
graph(%x : Tensor):
return (%x)
Other optimizations include Constant Propagation, Dead Code
Elimination (DCE), fusion, inlining, etc.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Serialization
>>> resnet = torch.jit.trace(models.resnet18(),
... torch.rand(1, 3, 224, 224))
>>> resnet.save("resnet.pt")
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Serialization
>>> resnet = torch.jit.trace(models.resnet18(),
... torch.rand(1, 3, 224, 224))
>>> resnet.save("resnet.pt")
$ file resnet.pt
resnet.pt: Zip archive data
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Serialization
>>> resnet = torch.jit.trace(models.resnet18(),
... torch.rand(1, 3, 224, 224))
>>> resnet.save("resnet.pt")
$ file resnet.pt
resnet.pt: Zip archive data
$ unzip resnet.pt
Archive: resnet.pt
extracting: resnet/version
extracting: resnet/code/__torch__/torchvision/models/resnet
extracting: resnet/data/0
(...)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Serialization
code/resnet.py
def forward(self: (...) resnet.ResNet,
x: Tensor) -> Tensor:
# (...)
_0 = (bn1).forward((conv1).forward(x, ), )
_1 = (maxpool).forward((relu).forward(_0, ), )
_2 = (layer2).forward((layer1).forward(_1, ), )
_3 = (layer4).forward((layer3).forward(_2, ), )
input = torch.flatten((avgpool).forward(_3, ), 1)
return (fc).forward(input, )
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Using the model in C++
In the example below we load the exported TorchScript model and run
the forward() using Torch’s C++ API.
Example of loading a traced model in PyTorch C++ API:
#include <torch/script.h>
int main(int argc, const char* argv[])
{
auto module = torch::jit::load("resnet.pt");
std::vector<torch::jit::IValue> inputs;
inputs.push_back(torch::ones({1, 3, 224, 224}));
at::Tensor output = module->forward(inputs).toTensor();
}
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Executing
Just like Python interpreter executes your code, PyTorch has an
interpreter that executes the IR instructions:
bool runImpl(Stack& stack) {
// (...) omitted
try {
while (true) {
Frame& frame = frames.back();
Instruction inst = INST_FETCH(0);
switch (inst.op) {
case INST(ENTER): {
INST_GUARD;
const auto& obj = peek(stack, 0, 1);
TORCH_INTERNAL_ASSERT(obj.isObject());
entered_objects.push_back(obj);
}
INST_NEXT;
// (...) omitted
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Section III
; Dynamo <
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Python Stack Frames
Conceptually, an interpreter
executes instructions within
a context, which we refer to
as frames.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Python Stack Frames
Conceptually, an interpreter
executes instructions within
a context, which we refer to
as frames.
A function call generates a
new frame, which is cleared
when the function returns.
This process is facilitated by
a stack, with the frames
being placed in order, thus
giving rise to the term stack
frames.
Global Frame
add
add
sub
a = 1
b = 1
ret = 2
sub a = 2
b = 4
ret = -2
add a = 2
b = 2
ret = 4
function
add(a, b)
function
sub(a, b)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
CPython Frame Evaluation
Frame evaluation in CPython happens in
_PyEval_EvalFrameDefault function. This is where the core of
Python execution is, all bytecode gets executed here and this function
is heavily optimized:
for (;;) {
opcode = next_uop->opcode;
oparg = next_uop->oparg;
// (...)
case UNARY_NOT: {
PyObject *value;
PyObject *res;
value = stack_pointer[-1];
assert(PyBool_Check(value));
res = Py_IsFalse(value) ? Py_True : Py_False;
stack_pointer[-1] = res;
break;
}
}
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
TorchDynamo
▸ TorchScript can be limiting in some situations. TorchDynamo
can overcome some of the limitations while still allowing
unmodified Python code to be compiled;
▸ TorchDynamo was introduced as a way to acquire graphs, it uses
a feature introduced in CPython 3.6 (PEP 523) where the frame
evaluation API was exposed to allow specification of a
per-interpreter function pointer to handle the evaluation of
frames;
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
TorchDynamo
void enable_eval_frame_shim(PyThreadState* tstate) {
#if PY_VERSION_HEX >= 0x03090000
if (_PyInterpreterState_GetEvalFrameFunc(tstate->interp) !=
&custom_eval_frame_shim) {
DEBUG_CHECK(previous_eval_frame == NULL);
previous_eval_frame = 
_PyInterpreterState_GetEvalFrameFunc(tstate->interp);
_PyInterpreterState_SetEvalFrameFunc(tstate->interp,
&custom_eval_frame_shim);
}
#else
if (tstate->interp->eval_frame != &custom_eval_frame_shim) {
// First call
tstate->interp->eval_frame = &custom_eval_frame_shim;
}
#endif
}
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
TorchDynamo
TorchDynamo behavior. Credit of the diagram to Jason Ansel.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
TorchDynamo
▸ TorchDynamo can switch back to the default Python frame
evaluation when it is not able to capture the graph, creating what
is called a graph break;
▸ The graph break can be created due to a lot of reasons such as:
calling external libs such as numpy, converting tensors to Python
types (e.g. Tensor.tolist() , Tensor.item() , etc);
▸ You can get the reason for each graph break and each graph
break has obviously a performance penalty of switching back and
forth between compiled code and Python code;
▸ TorchDynamo is used by torch.compile() but it is also
exposed in the torch_dynamo module.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
TorchDynamo
def my_fn(x):
x = x * 2
x = x.tolist()
x += [1, 2]
return x
def custom_backend(gm: torch.fx.GraphModule,
example_inputs: List[torch.Tensor]):
gm.graph.print_tabular()
return gm.forward
opt_my_fn = torch.compile(my_fn, backend=custom_backend)
ret = opt_my_fn(torch.tensor([1., 2.]))
Note that we are explicitly calling the Tensor.tolist() where
Torch will have to convert tensors into a Python list object.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
TorchDynamo
Our custom_backend was called just once with the following
captured graph:
opcode name target args kwargs
------------- ------ ----------------------- --------- --------
placeholder l_x_ L_x_ () {}
call_function mul <built-in function mul> (l_x_, 2) {}
output output output ((mul,),) {}
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
TorchDynamo
Our custom_backend was called just once with the following
captured graph:
opcode name target args kwargs
------------- ------ ----------------------- --------- --------
placeholder l_x_ L_x_ () {}
call_function mul <built-in function mul> (l_x_, 2) {}
output output output ((mul,),) {}
This graph captures only the x = x * 2 part of the code, because of
the graph break introduced due to the Tensor.tolist() operation.
TorchDynamo then delegates the execution of x += [1, 2] back to
Python’s default frame evaluation.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
TorchDynamo
What happens if we modify our my_fn function to go back to a torch
tensor and do a torch operation again ?
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
TorchDynamo
What happens if we modify our my_fn function to go back to a torch
tensor and do a torch operation again ?
def my_fn(x):
x = x * 2
# To Python list
x = x.tolist()
x += [1, 2]
# To torch tensor
x = torch.tensor(x)
x = x**2
return x
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
TorchDynamo
opcode name target args
------------- ------ ------------------------ ---------
placeholder l_x_ L_x_ ()
call_function mul <built-in function mul> (l_x_, 2)
output output output ((mul,),)
opcode name target args
------------- ------ ------------------------- -------------------
call_function tensor <built-in method tensor> ([2.0, 4.0, 1, 2],)
call_function pow_1 <built-in function pow> (tensor, 2)
output output output ((pow_1,),)
Note that our custom_backend was called twice with different
graphs representing the first part of computation and the second part
of the computation, without the pure-Python operations on the
Python list .
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
TorchDynamo
▸ So far, we haven’t actually compiled any of the graphs that our
custom_backend backend received. We have been focusing
only in the graph acquisition problem.
▸ To get performance improvements, we need to equip
torch.compile() with a compiler that will convert the
acquired graphs into efficient native code for different target
hardware such as NVIDIA GPUs, Arm CPUs, RISC-V CPUs,
TPUs, exotic edge devices such as your smart toaster, among
others.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
TorchDynamo
▸ So far, we haven’t actually compiled any of the graphs that our
custom_backend backend received. We have been focusing
only in the graph acquisition problem.
▸ To get performance improvements, we need to equip
torch.compile() with a compiler that will convert the
acquired graphs into efficient native code for different target
hardware such as NVIDIA GPUs, Arm CPUs, RISC-V CPUs,
TPUs, exotic edge devices such as your smart toaster, among
others.
That’s where TorchInductor comes into play.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Section IV
; Inductor <
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
AOTAutograd
▸ TorchDynamo generates Torch IR, which is a high-level
representation that is not suitable to many different compiler
backends;
▸ If we want to speed-up training as well, we need to capture the
backward pass as well, hence the need for the AOTAutograd,
where AOT stands for ahead-of-time;
▸ The AOTAutograd will generate ATen/Prims IR from tracing the
forward and backward graph ahead of time;
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
AOTAutograd
▸ TorchDynamo generates Torch IR, which is a high-level
representation that is not suitable to many different compiler
backends;
▸ If we want to speed-up training as well, we need to capture the
backward pass as well, hence the need for the AOTAutograd,
where AOT stands for ahead-of-time;
▸ The AOTAutograd will generate ATen/Prims IR from tracing the
forward and backward graph ahead of time;
▸ IRs in PyTorch are a complex subject with many levels and many
decompositions available;
▸ We will see an example of the difference between the graph
generated by TorchDynamo vs the graph generated by
AOTAutograd.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
The big picture
Slide from “Deep Dive into TorchInductor and PT2 Backend Integration". Sherlock Huang et al.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Dynamo Torch IR
Let’s take a look on the IR generated by TorchDynamo for the
following model:
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(8, 10)
def forward(self, x):
x = self.fc1(x)
x = torch.nn.functional.softmax(x, -1)
return x
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Dynamo Torch IR
Let’s use the print_readable() method to show the graph this
time:
def custom_backend(gm: torch.fx.GraphModule,
example_inputs: list[torch.Tensor]):
gm.print_readable()
return gm.forward
model = MLP()
my_fn_opt = torch.compile(model,
backend=custom_backend)
input_tensor = torch.randn(10, 8)
ret = my_fn_opt(input_tensor)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Dynamo Torch IR
This will yield the following IR:
class GraphModule(torch.nn.Module):
def forward(self, L_x_ : torch.Tensor):
l_x_ = L_x_
# code: x = self.fc1(x)
l__self___fc1 = self.L__self___fc1(l_x_);
l_x_ = None
# code: x = torch.nn.functional.softmax(x, -1)
softmax = torch.nn.functional.softmax(l__self___fc1, -1);
l__self___fc1 = None
return (softmax,)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
AOTAutograd ATen IR
Let’s now change the backend a bit to use AOTAutograd:
from torch._functorch.aot_autograd import 
aot_module_simplified
def custom_backend(gm: torch.fx.GraphModule,
example_inputs: list[torch.Tensor]):
def my_compiler(gm, example_inputs):
gm.print_readable()
return gm.forward
return aot_module_simplified(
gm,
example_inputs,
fw_compiler=my_compiler
)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
AOTAutograd ATen IR
And here we are with the AOTAutograd generated IR (with = None ’s
and some comments removed for brevity):
class GraphModule(torch.nn.Module):
def forward(self, primals_1: f32[10, 8],
primals_2: f32[10], primals_3: f32[10, 8]):
# code: x = self.fc1(x)
t: f32[8, 10] = torch.ops.aten.t.default(primals_1)
addmm: f32[10, 10] = 
torch.ops.aten.addmm.default(primals_2, primals_3, t)
# code: x = torch.nn.functional.softmax(x, -1)
_softmax: f32[10, 10] = 
torch.ops.aten._softmax.default(addmm, -1, False)
return [_softmax, primals_3, _softmax]
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
TorchInductor
Inductor takes the graph produced by AOTAutograd (consisting of
ATen/Prim IR) and perform further graph decompositions:
def forward(self, arg0_1: f32[10, 8], arg1_1: f32[10],
arg2_1: f32[10, 8]):
# code: x = self.fc1(x)
permute: f32[8, 10] = torch.ops.aten.permute.default(arg0_1, [1, 0])
addmm: f32[1024, 10] = 
torch.ops.aten.addmm.default(arg1_1, arg2_1, permute);
# code: x = torch.nn.functional.softmax(x, -1)
amax: f32[10, 1] = torch.ops.aten.amax.default(addmm, [-1], True)
sub: f32[10, 10] = torch.ops.aten.sub.Tensor(addmm, amax)
exp: f32[10, 10] = torch.ops.aten.exp.default(sub)
sum_1: f32[10, 1] = torch.ops.aten.sum.dim_IntList(exp, [-1], True)
div: f32[10, 10] = torch.ops.aten.div.Tensor(exp, sum_1)
return (div,)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
TorchInductor
▸ After that, the graph goes to the scheduling phase where fusion
can happen and then to the appropriate TorchInductor backend;
▸ TorchInductor can generate C++/OpenMP code or Triton. The
generated kernels are then called by a generated wrapper;
▸ Industry is collaborating with backend optimizations (e.g. Intel
speedups for CPU bfloat16 in some recent processors);
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
TorchInductor
▸ After that, the graph goes to the scheduling phase where fusion
can happen and then to the appropriate TorchInductor backend;
▸ TorchInductor can generate C++/OpenMP code or Triton. The
generated kernels are then called by a generated wrapper;
▸ Industry is collaborating with backend optimizations (e.g. Intel
speedups for CPU bfloat16 in some recent processors);
▸ We will see now a part of a C++ kernel generated by
TorchInductor for the fused softmax with CPU tensors (in
MacOS as an example).
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
TorchInductor
extern "C" void kernel(float* in_out_ptr0,
float* out_ptr0, float* out_ptr1)
{
auto in_ptr0 = in_out_ptr0;
{
#pragma GCC ivdep
for(long i0=static_cast<long>(0L); i0<static_cast<long>(10L);
i0+=static_cast<long>(1L))
{
float tmp_acc0 = -std::numeric_limits<float>::infinity();
for(long i1=static_cast<long>(0L); i1<static_cast<long>(10L);
i1+=static_cast<long>(1L))
{
auto tmp0 = in_ptr0[static_cast<long>(i1 + (10L*i0))];
tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0);
}
out_ptr0[static_cast<long>(i0)] = tmp_acc0;
}
}
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
TorchInductor
Now, if we run the same code with CUDA tensors, what we will get is
the Triton kernel below:
@triton.jit
def triton_(in_ptr0, out_ptr2, xnumel, rnumel, XBLOCK : tl.constexpr):
# ... (omitted for brevity)
tmp0 = tl.load(in_ptr0 + (r1 + (10*x0)), rmask & xmask, other=0)
tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK])
tmp3 = tl.where(rmask & xmask, tmp1, float("-inf"))
tmp4 = triton_helpers.max2(tmp3, 1)[:, None]
tmp5 = tmp0 - tmp4
tmp6 = tl.exp(tmp5)
tmp7 = tl.broadcast_to(tmp6, [XBLOCK, RBLOCK])
tmp9 = tl.where(rmask & xmask, tmp7, 0)
tmp10 = tl.sum(tmp9, 1)[:, None]
tmp11 = tmp6 / tmp10
tl.store(out_ptr2 + (r1 + (10*x0)), tmp11, rmask & xmask)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Section V
; Torch Export <
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Torch Export Path
▸ Torch Export ( torch.export ) was created to do whole-graph
capture;
▸ As we discussed earlier, TorchDynamo can create graph breaks
and do this back-and-forth with the Python interpreter;
▸ This cooperative dynamic with Python makes it difficult to be
able to embed it in environments without the Python runtime;
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Torch Export Path
▸ Torch Export ( torch.export ) was created to do whole-graph
capture;
▸ As we discussed earlier, TorchDynamo can create graph breaks
and do this back-and-forth with the Python interpreter;
▸ This cooperative dynamic with Python makes it difficult to be
able to embed it in environments without the Python runtime;
▸ torch.export relies on the torch.compile stack, but with
important differences: it doesn’t fallback to Python interpreter, so
captured graph cannot have graph breaks and code changes can
be required;
▸ The main goal of torch.export is to provide normalized IR
using Core ATen IR opset that can be loaded and executed in
different languages/environments.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Dynamo Torch IR
Let’s use the same code we used earlier with TorchDynamo and export
it with torch.export :
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(8, 10)
def forward(self, x):
x = self.fc1(x)
x = torch.nn.functional.softmax(x, -1)
return x
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Torch Export
>>> import torch.export as export
>>> model = MLP()
>>> sample = torch.randn(10, 8)
>>> exp = export.export(model, (sample,))
>>> exp
<torch.export.ExportedProgram object at 0x163c8ad10>
>>> print(exp)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Torch Export
>>> import torch.export as export
>>> model = MLP()
>>> sample = torch.randn(10, 8)
>>> exp = export.export(model, (sample,))
>>> exp
<torch.export.ExportedProgram object at 0x163c8ad10>
>>> print(exp)
class GraphModule(torch.nn.Module):
def forward(self, arg0_1: f32[10, 8], arg1_1: f32[10], arg2_1: f32[10, 8]):
permute: f32[8, 10] = 
torch.ops.aten.permute.default(arg0_1, [1, 0])
addmm: f32[10, 10] = 
torch.ops.aten.addmm.default(arg1_1, arg2_1, permute)
_softmax: f32[10, 10] = 
torch.ops.aten._softmax.default(addmm, -1, False)
return (_softmax,)
(...)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Torch Export
Let’s serialize the exported graph:
>>> export.save(exp, "serialized_graph.pt2")
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Torch Export
Let’s serialize the exported graph:
>>> export.save(exp, "serialized_graph.pt2")
We can see that the format is a zip archive:
$ file serialized_graph.pt2
serialized_graph.pt2: Zip archive data
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Torch Export
Let’s serialize the exported graph:
>>> export.save(exp, "serialized_graph.pt2")
We can see that the format is a zip archive:
$ file serialized_graph.pt2
serialized_graph.pt2: Zip archive data
... and we can extract to inspect:
$ unzip serialized_graph.pt2
extracting: serialized_exported_program.json
extracting: serialized_state_dict.json
extracting: version
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Torch Export
There is a version file:
$ cat version
1
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Torch Export
There is a version file:
$ cat version
1
A serialized_exported_program.json :
$ file serialized_exported_program.json
serialized_exported_program.json: JSON data
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Torch Export
There is a version file:
$ cat version
1
A serialized_exported_program.json :
$ file serialized_exported_program.json
serialized_exported_program.json: JSON data
And the serialized_state_dict.json :
$ file serialized_state_dict.json
serialized_state_dict.json: Zip archive data
Not sure why PyTorch uses a json extension for a Zip archive.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Torch Export
$ jq "keys" serialized_exported_program.json
["equality_constraints",
"example_inputs",
"graph_module",
"opset_version",
"range_constraints",
"schema_version"]
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Torch Export
$ jq "keys" serialized_exported_program.json
["equality_constraints",
"example_inputs",
"graph_module",
"opset_version",
"range_constraints",
"schema_version"]
The graph is in the graph_module and there is a opset_version
with the used ATen IR opset version:
$ jq .opset_version serialized_exported_program.json
{
"aten": 10
}
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Torch Export
Let’s see the nodes from the graph:
$ jq ".graph_module.graph.nodes[].target" (...)
"torch.ops.aten.permute.default"
"torch.ops.aten.addmm.default"
"torch.ops.aten._softmax.default"
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Torch Export
Let’s see the nodes from the graph:
$ jq ".graph_module.graph.nodes[].target" (...)
"torch.ops.aten.permute.default"
"torch.ops.aten.addmm.default"
"torch.ops.aten._softmax.default"
Let’s see the outputs of the graph:
$ jq .graph_module.graph.outputs (...)
[{
"as_none": null,
"as_tensor": {
"name": "_softmax"
},
"as_tensors": null,
"as_int": null,
"as_ints": null,
"..."
}]
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Torch Export
▸ You might need to rewrite your code if you use torch.export ,
especially if you have graph breaks and data/shape-dependent
control flow as well;
▸ torch.export is, nevertheless, a very nice direction towards
standardization of the IR. If vendors adopt it, you can skip
intermediate representations (e.g. ONNX) and many nightmares;
▸ APIs, IRs opsets are very recent and subject to changes, so keep
an eye on its development;
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Torch Export
▸ You might need to rewrite your code if you use torch.export ,
especially if you have graph breaks and data/shape-dependent
control flow as well;
▸ torch.export is, nevertheless, a very nice direction towards
standardization of the IR. If vendors adopt it, you can skip
intermediate representations (e.g. ONNX) and many nightmares;
▸ APIs, IRs opsets are very recent and subject to changes, so keep
an eye on its development;
▸ We have now a serialized graph, let’s now find out how we can
actually execute it outside of Python. That’s where ExecuTorch
joins the party !
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Section VI
; ExecuTorch <
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
ExecuTorch
▸ ExecuTorch (ET) leverages PyTorch 2 compiler and export path
to enable on-device execution of PyTorch models;
▸ Portable runtime, low memory footprint and doesn’t use
TorchScript (as in PyTorch mobile);
▸ Still a lot of on-going development, this talk is aligned with the
v0.1.0 branch of ExecuTorch, a preview release for testing and
evaluation;
▸ Multiple backends (arm, qualcomm, xnnpack, apple, etc) where
ExecuTorch can delegate to DSPs, NPUs, CPUs, etc, being
developed;
▸ Hope to see more industry collaboration.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
ExecuTorch
Executorch has two main phases:
AOT (Ahead of Time)
This is the program preparation (before the execution). ExecuTorch leverages
TorchDynamo and PyTorch export to convert the model into an IR.
Optionally, backends can plug-in in this phase as well in what is called
backend delegation for AOT.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
ExecuTorch
Executorch has two main phases:
AOT (Ahead of Time)
This is the program preparation (before the execution). ExecuTorch leverages
TorchDynamo and PyTorch export to convert the model into an IR.
Optionally, backends can plug-in in this phase as well in what is called
backend delegation for AOT.
Runtime
ExecuTorch runtime executes models on the edge devices (which can be a
high-end or very constrained edge device). It will initialize, execute and
release resources. It will also initialize delegates and (surprise) delegate
execution of the program (or parts of it) to them as well.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
ExecuTorch Concept Overview
Image from ExecuTorch documentation, December 2023.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
ExecuTorch Lowering
ExecuTorch performs progressive lowering of the graph or parts of the
graph to different IRs, so the operations get progressively closer to the
hardware:
▸ Edge dialect: all operators from predefined operator set and
inputs/outputs must be tensor
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
ExecuTorch Lowering
ExecuTorch performs progressive lowering of the graph or parts of the
graph to different IRs, so the operations get progressively closer to the
hardware:
▸ Edge dialect: all operators from predefined operator set and
inputs/outputs must be tensor
▸ Backend dialect: immediate result of exporting Edge dialect to a
particular backend. Allows the introduction of target-specific
operators (that are aware of the hardware they will run later)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
ExecuTorch Memory Planning
Before serializing the program ( .pte file), ExecuTorch performs
memory planning. It uses size and lifespan of mutable tensors to plan
their location (offset) in fixed size memory arenas:
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
ExecuTorch Memory Planning
Before serializing the program ( .pte file), ExecuTorch performs
memory planning. It uses size and lifespan of mutable tensors to plan
their location (offset) in fixed size memory arenas:
Naive algorithm
Concatenates all the tensors together in a linear memory without considering
any memory re-use.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
ExecuTorch Memory Planning
Before serializing the program ( .pte file), ExecuTorch performs
memory planning. It uses size and lifespan of mutable tensors to plan
their location (offset) in fixed size memory arenas:
Naive algorithm
Concatenates all the tensors together in a linear memory without considering
any memory re-use.
Greedy algorithm
Tries to re-use the already allocated memory and choose based on the best-fit
criteria.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
ExecuTorch Memory Planning
Before serializing the program ( .pte file), ExecuTorch performs
memory planning. It uses size and lifespan of mutable tensors to plan
their location (offset) in fixed size memory arenas:
Naive algorithm
Concatenates all the tensors together in a linear memory without considering
any memory re-use.
Greedy algorithm
Tries to re-use the already allocated memory and choose based on the best-fit
criteria.
program = edge_program.to_executorch( # Example
exir.ExecutorchBackendConfig(
memory_planning_pass=MemoryPlanningPass(
memory_planning_algo="greedy",
# (...)
)))
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
ExecuTorch Export
Let’s export the same model that we had before:
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(8, 10)
def forward(self, x):
x = self.fc1(x)
x = torch.nn.functional.softmax(x, -1)
return x
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
ExecuTorch Export
from torch._export import capture_pre_autograd_graph
from executorch.exir import to_edge
model = MLP()
model = model.eval()
inputs = (torch.randn(10, 8),)
pre_atgrad_aten_ir = capture_pre_autograd_graph(model,
inputs)
aten_ir = export.export(pre_atgrad_aten_ir, inputs)
edge_ir = to_edge(aten_ir)
program = edge_ir.to_executorch()
with open("model.pte", "wb") as fhandle:
fhandle.write(program.buffer)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
ExecuTorch Serialization
The serialization of the program uses the same memory efficient
format used in TensorFlow Lite: FlatBuffers. The Program schema is
defined in the schema/program.fbs file:
// (...) omitted for brevity
table Program {
// Schema version.
version:uint;
// List of ExecutionPlans that make up the program.
// Each ExecutionPlan corresponds with a different
// entry point into the model.
execution_plan:[ExecutionPlan];
// (...) omitted for brevity
}
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
ExecuTorch Serialization
Let’s see how our exported program looks like by converting the
binary flatbuffer to json:
$ flatc --strict-json --raw-binary 
-t executorch/schema/program.fbs -- ./model.pte
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
ExecuTorch Serialization
Let’s see how our exported program looks like by converting the
binary flatbuffer to json:
$ flatc --strict-json --raw-binary 
-t executorch/schema/program.fbs -- ./model.pte
$ jq ".execution_plan[0].name" model.json
"forward"
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
ExecuTorch Serialization
Let’s see how our exported program looks like by converting the
binary flatbuffer to json:
$ flatc --strict-json --raw-binary 
-t executorch/schema/program.fbs -- ./model.pte
$ jq ".execution_plan[0].name" model.json
"forward"
$ jq ".execution_plan[0].operators[].name" model.json
"aten::permute_copy"
"aten::addmm"
"aten::_softmax"
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Memory Planning in Action
Let’s see how one tensor looks like in the Program :
// (...)
"val_type": "Tensor",
"val": {
"scalar_type": "FLOAT",
"sizes": [10, 8],
"dim_order": [0, 1],
"allocation_info": {
"memory_id": 1,
"memory_offset": 800
}
}
// (...)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Memory Planning in Action
Constant tensors (e.g. weights in a Linear layer) are handled
differently than mutable tensors:
Result<void*> getTensorDataPtr(...) {
if (s_tensor->constant_buffer_idx() > 0) {
auto data =
program->get_constant_buffer_data(
s_tensor->constant_buffer_idx());
return const_cast<void*>(data.get());
}
const executorch_flatbuffer::AllocationDetails* allocation_info =
s_tensor->allocation_info();
if (allocation_info != nullptr) {
const uint32_t memory_id = allocation_info->memory_id() - 1;
return allocator->get_offset_address(
memory_id, allocation_info->memory_offset(), nbytes);
}
// (...)
}
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
ExecuTorch Concept Overview
Image from ExecuTorch documentation, December 2023.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
ExecuTorch Runtime
ExecuTorch runtime is a portable runtime:
▸ C++11 compatible, no exceptions or RTTI
▸ They provide cmake and buck2 build support
▸ Memory allocation mechanism is provided by the user, the core
runtime doesn’t do memory allocations (although backend
kernels might, but disencouraged to do so)
▸ Can have different memory regions for mutable tensors (e.g.
SRAM/DRAM placement)
▸ Without kernels or backend, runtime is 50kb
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
ExecuTorch Runtime
We have now the exported Program and want to load the
model.pte and execute it on the edge.
▸ At this point, your next steps will depend on the edge device you
want the runtime to run;
▸ There are many examples in ExecuTorch on how to deploy using
XNNPACK, or targeting ARM (e.g. Ethos-U NPU), Qualcomm
Hexagon NPU, DSPs, building Android/iOS apps, etc;
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
ExecuTorch Runtime
We have now the exported Program and want to load the
model.pte and execute it on the edge.
▸ At this point, your next steps will depend on the edge device you
want the runtime to run;
▸ There are many examples in ExecuTorch on how to deploy using
XNNPACK, or targeting ARM (e.g. Ethos-U NPU), Qualcomm
Hexagon NPU, DSPs, building Android/iOS apps, etc;
▸ For this tutorial, I will target a Pixel Watch 2 device (with a
Cortex A53) and use the portable CPU kernels.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Loading the Program
Let’s start looking at how we can use the runtime in C++ by first
loading the serialized Program :
Result<FileDataLoader> loader = FileDataLoader::from(model_path);
Result<Program> program = Program::load(&loader.get());
Result<MethodMeta> method_meta = program->method_meta("forward");
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Loading the Program
Let’s start looking at how we can use the runtime in C++ by first
loading the serialized Program :
Result<FileDataLoader> loader = FileDataLoader::from(model_path);
Result<Program> program = Program::load(&loader.get());
Result<MethodMeta> method_meta = program->method_meta("forward");
▸ The .pte file is opened
▸ File header is parsed
▸ Flatbuffer is created with serialized data
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Memory Affair
Let’s now create an allocator method_allocator for the method
structure:
static uint8_t method_allocator_pool[4 * 1024U * 1024U];
MemoryAllocator method_allocator{
MemoryAllocator(sizeof(method_allocator_pool),
method_allocator_pool)};
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Memory Affair
Let’s now create an allocator method_allocator for the method
structure:
static uint8_t method_allocator_pool[4 * 1024U * 1024U];
MemoryAllocator method_allocator{
MemoryAllocator(sizeof(method_allocator_pool),
method_allocator_pool)};
Most of this code is from executor_runner.cpp in ExecuTorch.
Don’t get too attached to the idiosyncrasies, but to what it is actually
doing.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Memory Affair
Let’s allocate now the planned buffers for the mutable tensors:
std::vector<std::unique_ptr<uint8_t[]>> buffers;
std::vector<Span<uint8_t>> spans;
size_t n_planned_buffers = 
method_meta->num_memory_planned_buffers();
for (size_t id = 0; id < n_planned_buffers; ++id) {
size_t buffer_size = 
method_meta->memory_planned_buffer_size(id).get();
buffers.push_back(std::make_unique<uint8_t[]>(buffer_size));
spans.push_back({buffers.back().get(),
buffer_size});
}
HierarchicalAllocator planned_memory({buffers.data(), spans.size()});
MemoryManager memory_manager(&method_allocator, &planned_memory);
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Memory Affair
We can now finally execute the method:
Result<Method> method = 
program->load_method("forward", &memory_manager);
method.set_input(...) // set the method inputs
Error status = method->execute();
// Get the outputs into "outputs"
std::vector<EValue> outputs(method->outputs_size());
status = method->get_outputs(outputs.data(), outputs.size());
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Our victim today
▸ Google Pixel Watch 2
▸ Qualcomm SW5100, 4x Cortex A53
cores
▸ 2GB of RAM
▸ Android Wear OS 4
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Our victim today
▸ Google Pixel Watch 2
▸ Qualcomm SW5100, 4x Cortex A53
cores
▸ 2GB of RAM
▸ Android Wear OS 4
▸ I’m not affiliated with Google, this
happened to be the first small device
in front of me. I’m planning to
experiment with a more constrained
RP2040 (Raspberry Pi Pico,
Cortex-M0+) next time.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Which CPU is that
Pixel Watch 2 runs Android, let’s see the architecture:
$ uname -a
Linux localhost 5.15.104-android13-(...) armv8l Toybox
Interestingly this SoC supports armv8 64-bits, but it is running on
32-bits with the kernel compiled for armv8l (32-bits, little ending).
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Which CPU is that
Pixel Watch 2 runs Android, let’s see the architecture:
$ uname -a
Linux localhost 5.15.104-android13-(...) armv8l Toybox
Interestingly this SoC supports armv8 64-bits, but it is running on
32-bits with the kernel compiled for armv8l (32-bits, little ending).
$ cat /proc/cpuinfo
processor : 0
model name : ARMv8 Processor rev 4 (v8l)
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4
idiva idivt lpae evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x51
CPU architecture: 8
CPU variant : 0xa
CPU part : 0x801
CPU revision : 4
(...)
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Toolchains Everywhere
Let’s prepare to use the Android toolchain for cross-compilation:
Download the Android NDK and set its path:
$ export ANDROID_NDK=/opt/android-ndk-r26b
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Toolchains Everywhere
Let’s prepare to use the Android toolchain for cross-compilation:
Download the Android NDK and set its path:
$ export ANDROID_NDK=/opt/android-ndk-r26b
Then we just add some variables into CMakeLists.txt in
ExecuTorch:
set(CMAKE_SYSTEM_NAME Android)
set(CMAKE_SYSTEM_VERSION 24)
set(CMAKE_ANDROID_ARCH_ABI armeabi-v7a)
I only found the compatible armeabi-v7a architecture in Android
NDK, since armv8l is backwards compatible with ARMv7, I’m using
this one.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Selective Build
There are many ways of building our application and linking to
ExecuTorch, what we will use is the selective build, which will select
only a few kernels to be compiled and we will use MobileNetV2.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Selective Build
There are many ways of building our application and linking to
ExecuTorch, what we will use is the selective build, which will select
only a few kernels to be compiled and we will use MobileNetV2.
Luckily, ExecuTorch has some scripts to help with exporting the
model and compiling. Let’s export MobileNetV2 ( mv2 ):
$ python3 -m examples.portable.scripts.export --model_name="mv2"
This will create the serialized program mv2.pte .
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Selective Build
There are many ways of building our application and linking to
ExecuTorch, what we will use is the selective build, which will select
only a few kernels to be compiled and we will use MobileNetV2.
Luckily, ExecuTorch has some scripts to help with exporting the
model and compiling. Let’s export MobileNetV2 ( mv2 ):
$ python3 -m examples.portable.scripts.export --model_name="mv2"
This will create the serialized program mv2.pte .
Now we can compile it with cmake :
$ examples/selective_build/test_selective_build.sh cmake
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Selective Build
You can look at the test_selective_build.sh but the important
bit here is the selected ops list we are building in our application:
$ cmake (...) -DEXECUTORCH_SELECT_OPS_LIST="aten::convolution.out,
(...) aten::mean.out,aten::view_copy.out,aten::permute_copy.out,
aten::addmm.out,aten,aten::clone.out"
Instead of building all kernels, we are selecting only a few of them.
This is very important for more constrained devices.
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Selective Build
You can look at the test_selective_build.sh but the important
bit here is the selected ops list we are building in our application:
$ cmake (...) -DEXECUTORCH_SELECT_OPS_LIST="aten::convolution.out,
(...) aten::mean.out,aten::view_copy.out,aten::permute_copy.out,
aten::addmm.out,aten,aten::clone.out"
Instead of building all kernels, we are selecting only a few of them.
This is very important for more constrained devices.
We just copy our binary model_app and the exported model
mv2.pte to the Pixel Watch 2 using Android adb tool and then run
the model:
$ model_app --model_path="mv2.pte"
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Selective Build
The output of executing the example app in the Pixel Watch 2 will be
something like this:
Output 0: tensor(sizes=[1, 1000], [
-0.50986, 0.300638, 0.0953863, 0.147721, 0.231201, 0.338555,
0.20689, -0.0575741, -0.389267, -0.0606858, -0.0213996,
-0.121034, -0.288955, 0.134052, -0.171977, -0.060362,
0.0203591, -0.0585306, 0.337859, -0.0718654, 0.490758,
0.524143, 0.197859, 0.122067, -0.35913, 0.10946, 0.347745,
0.478512, 0.226557, 0.0363519,
(...)
Showing the 1000 class logits for the input (all 1’s in our case).
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Thanks !
I hope you enjoyed this presentation ! This was an overview of the
internals of some of the projects in the PyTorch ecosystem that came
out recently. I skipped some other important aspects such as
distributed training, but hopefully it will come soon in the next
iteration of this presentation.
Huge thanks to all PyTorch contributors !
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Section VII
; Q&A <
PyTorch 2 internals - Christian S. Perone (2023)
Tensors JIT Dynamo Inductor Torch Export ExecuTorch
Q&A
Thanks !

More Related Content

What's hot

PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...
PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...
PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...
Edureka!
 
Unityネイティブプラグインマニアクス #denatechcon
Unityネイティブプラグインマニアクス #denatechconUnityネイティブプラグインマニアクス #denatechcon
Unityネイティブプラグインマニアクス #denatechcon
DeNA
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loader
Sadayuki Furuhashi
 
async/await のしくみ
async/await のしくみasync/await のしくみ
async/await のしくみ
信之 岩永
 
MagicOnion入門
MagicOnion入門MagicOnion入門
MagicOnion入門
torisoup
 
.NET Core 3.0時代のメモリ管理
.NET Core 3.0時代のメモリ管理.NET Core 3.0時代のメモリ管理
.NET Core 3.0時代のメモリ管理
KageShiron
 
Jenkins 2を使った究極のpipeline ~ 明日もう一度来てください、本物のpipelineをお見せしますよ ~
Jenkins 2を使った究極のpipeline ~ 明日もう一度来てください、本物のpipelineをお見せしますよ ~Jenkins 2を使った究極のpipeline ~ 明日もう一度来てください、本物のpipelineをお見せしますよ ~
Jenkins 2を使った究極のpipeline ~ 明日もう一度来てください、本物のpipelineをお見せしますよ ~
ikikko
 
Accelerated Training of Transformer Models
Accelerated Training of Transformer ModelsAccelerated Training of Transformer Models
Accelerated Training of Transformer Models
Databricks
 
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システムMySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム
Kouhei Sutou
 
5分で分かるgitのrefspec
5分で分かるgitのrefspec5分で分かるgitのrefspec
5分で分かるgitのrefspec
ikdysfm
 
究極のゲーム用通信プロトコルを探せ!
究極のゲーム用通信プロトコルを探せ!究極のゲーム用通信プロトコルを探せ!
究極のゲーム用通信プロトコルを探せ!
Ryosuke Otsuya
 
オープンソースライセンスの基礎と実務
オープンソースライセンスの基礎と実務オープンソースライセンスの基礎と実務
オープンソースライセンスの基礎と実務
Yutaka Kachi
 
TensorFlow XLAの可能性
TensorFlow XLAの可能性 TensorFlow XLAの可能性
TensorFlow XLAの可能性
Mr. Vengineer
 
WebRTC入門 ~沖縄編~
WebRTC入門 ~沖縄編~WebRTC入門 ~沖縄編~
WebRTC入門 ~沖縄編~
Ryosuke Otsuya
 
Photon Fusionのはじめの一歩
Photon Fusionのはじめの一歩Photon Fusionのはじめの一歩
Photon Fusionのはじめの一歩
聡 大久保
 
GKE のアプリデプロイは Spinnaker に任せて!
GKE のアプリデプロイは Spinnaker に任せて!GKE のアプリデプロイは Spinnaker に任せて!
GKE のアプリデプロイは Spinnaker に任せて!
Samir Hammoudi
 
多機能ボイチャを簡単に導入する方法
多機能ボイチャを簡単に導入する方法多機能ボイチャを簡単に導入する方法
多機能ボイチャを簡単に導入する方法
Unity Technologies Japan K.K.
 
【Unite Tokyo 2019】2DアーティストのためのGPU入門
【Unite Tokyo 2019】2DアーティストのためのGPU入門【Unite Tokyo 2019】2DアーティストのためのGPU入門
【Unite Tokyo 2019】2DアーティストのためのGPU入門
UnityTechnologiesJapan002
 
【Unite Tokyo 2019】Understanding C# Struct All Things
【Unite Tokyo 2019】Understanding C# Struct All Things【Unite Tokyo 2019】Understanding C# Struct All Things
【Unite Tokyo 2019】Understanding C# Struct All Things
UnityTechnologiesJapan002
 
Vespa 機能紹介 #yjmu
Vespa 機能紹介 #yjmuVespa 機能紹介 #yjmu

What's hot (20)

PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...
PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...
PyTorch Python Tutorial | Deep Learning Using PyTorch | Image Classifier Usin...
 
Unityネイティブプラグインマニアクス #denatechcon
Unityネイティブプラグインマニアクス #denatechconUnityネイティブプラグインマニアクス #denatechcon
Unityネイティブプラグインマニアクス #denatechcon
 
Embulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loaderEmbulk, an open-source plugin-based parallel bulk data loader
Embulk, an open-source plugin-based parallel bulk data loader
 
async/await のしくみ
async/await のしくみasync/await のしくみ
async/await のしくみ
 
MagicOnion入門
MagicOnion入門MagicOnion入門
MagicOnion入門
 
.NET Core 3.0時代のメモリ管理
.NET Core 3.0時代のメモリ管理.NET Core 3.0時代のメモリ管理
.NET Core 3.0時代のメモリ管理
 
Jenkins 2を使った究極のpipeline ~ 明日もう一度来てください、本物のpipelineをお見せしますよ ~
Jenkins 2を使った究極のpipeline ~ 明日もう一度来てください、本物のpipelineをお見せしますよ ~Jenkins 2を使った究極のpipeline ~ 明日もう一度来てください、本物のpipelineをお見せしますよ ~
Jenkins 2を使った究極のpipeline ~ 明日もう一度来てください、本物のpipelineをお見せしますよ ~
 
Accelerated Training of Transformer Models
Accelerated Training of Transformer ModelsAccelerated Training of Transformer Models
Accelerated Training of Transformer Models
 
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システムMySQL・PostgreSQLだけで作る高速あいまい全文検索システム
MySQL・PostgreSQLだけで作る高速あいまい全文検索システム
 
5分で分かるgitのrefspec
5分で分かるgitのrefspec5分で分かるgitのrefspec
5分で分かるgitのrefspec
 
究極のゲーム用通信プロトコルを探せ!
究極のゲーム用通信プロトコルを探せ!究極のゲーム用通信プロトコルを探せ!
究極のゲーム用通信プロトコルを探せ!
 
オープンソースライセンスの基礎と実務
オープンソースライセンスの基礎と実務オープンソースライセンスの基礎と実務
オープンソースライセンスの基礎と実務
 
TensorFlow XLAの可能性
TensorFlow XLAの可能性 TensorFlow XLAの可能性
TensorFlow XLAの可能性
 
WebRTC入門 ~沖縄編~
WebRTC入門 ~沖縄編~WebRTC入門 ~沖縄編~
WebRTC入門 ~沖縄編~
 
Photon Fusionのはじめの一歩
Photon Fusionのはじめの一歩Photon Fusionのはじめの一歩
Photon Fusionのはじめの一歩
 
GKE のアプリデプロイは Spinnaker に任せて!
GKE のアプリデプロイは Spinnaker に任せて!GKE のアプリデプロイは Spinnaker に任せて!
GKE のアプリデプロイは Spinnaker に任せて!
 
多機能ボイチャを簡単に導入する方法
多機能ボイチャを簡単に導入する方法多機能ボイチャを簡単に導入する方法
多機能ボイチャを簡単に導入する方法
 
【Unite Tokyo 2019】2DアーティストのためのGPU入門
【Unite Tokyo 2019】2DアーティストのためのGPU入門【Unite Tokyo 2019】2DアーティストのためのGPU入門
【Unite Tokyo 2019】2DアーティストのためのGPU入門
 
【Unite Tokyo 2019】Understanding C# Struct All Things
【Unite Tokyo 2019】Understanding C# Struct All Things【Unite Tokyo 2019】Understanding C# Struct All Things
【Unite Tokyo 2019】Understanding C# Struct All Things
 
Vespa 機能紹介 #yjmu
Vespa 機能紹介 #yjmuVespa 機能紹介 #yjmu
Vespa 機能紹介 #yjmu
 

Similar to PyTorch 2 Internals

PyTorch under the hood
PyTorch under the hoodPyTorch under the hood
PyTorch under the hood
Christian Perone
 
ZCA: A component architecture for Python
ZCA: A component architecture for PythonZCA: A component architecture for Python
ZCA: A component architecture for PythonTimo Stollenwerk
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)
Hansol Kang
 
MTPLs
MTPLsMTPLs
MTPLs
miso_uam
 
PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017
Yu-Hsun (lymanblue) Lin
 
[Update] PyTorch Tutorial for NTU Machine Learing Course 2017
[Update] PyTorch Tutorial for NTU Machine Learing Course 2017[Update] PyTorch Tutorial for NTU Machine Learing Course 2017
[Update] PyTorch Tutorial for NTU Machine Learing Course 2017
Yu-Hsun (lymanblue) Lin
 
Pytroch-basic.pptx
Pytroch-basic.pptxPytroch-basic.pptx
Pytroch-basic.pptx
rebeen4
 
PyTorch for Deep Learning Practitioners
PyTorch for Deep Learning PractitionersPyTorch for Deep Learning Practitioners
PyTorch for Deep Learning Practitioners
Bayu Aldi Yansyah
 
Magical float repr
Magical float reprMagical float repr
Magical float reprdickinsm
 
The str/bytes nightmare before python2 EOL
The str/bytes nightmare before python2 EOLThe str/bytes nightmare before python2 EOL
The str/bytes nightmare before python2 EOL
Kir Chou
 
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...
Takahiro Katagiri
 
Introduction to Python.Net
Introduction to Python.NetIntroduction to Python.Net
Introduction to Python.Net
Stefan Schukat
 
Libraries
LibrariesLibraries
Time and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptxTime and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptx
dudelover
 
Permute
PermutePermute
Machine learning with py torch
Machine learning with py torchMachine learning with py torch
Machine learning with py torch
Riza Fahmi
 
Permute
PermutePermute
Designing Architecture-aware Library using Boost.Proto
Designing Architecture-aware Library using Boost.ProtoDesigning Architecture-aware Library using Boost.Proto
Designing Architecture-aware Library using Boost.Proto
Joel Falcou
 

Similar to PyTorch 2 Internals (20)

PyTorch under the hood
PyTorch under the hoodPyTorch under the hood
PyTorch under the hood
 
ZCA: A component architecture for Python
ZCA: A component architecture for PythonZCA: A component architecture for Python
ZCA: A component architecture for Python
 
tokyotalk
tokyotalktokyotalk
tokyotalk
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)
 
MTPLs
MTPLsMTPLs
MTPLs
 
PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017PyTorch Tutorial for NTU Machine Learing Course 2017
PyTorch Tutorial for NTU Machine Learing Course 2017
 
[Update] PyTorch Tutorial for NTU Machine Learing Course 2017
[Update] PyTorch Tutorial for NTU Machine Learing Course 2017[Update] PyTorch Tutorial for NTU Machine Learing Course 2017
[Update] PyTorch Tutorial for NTU Machine Learing Course 2017
 
Pytroch-basic.pptx
Pytroch-basic.pptxPytroch-basic.pptx
Pytroch-basic.pptx
 
PyTorch for Deep Learning Practitioners
PyTorch for Deep Learning PractitionersPyTorch for Deep Learning Practitioners
PyTorch for Deep Learning Practitioners
 
Magical float repr
Magical float reprMagical float repr
Magical float repr
 
The str/bytes nightmare before python2 EOL
The str/bytes nightmare before python2 EOLThe str/bytes nightmare before python2 EOL
The str/bytes nightmare before python2 EOL
 
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...
Towards Auto-tuning Facilities into Supercomputers in Operation - The FIBER a...
 
Introduction to Python.Net
Introduction to Python.NetIntroduction to Python.Net
Introduction to Python.Net
 
Libraries
LibrariesLibraries
Libraries
 
Time and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptxTime and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptx
 
Permute
PermutePermute
Permute
 
Machine learning with py torch
Machine learning with py torchMachine learning with py torch
Machine learning with py torch
 
Permute
PermutePermute
Permute
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Designing Architecture-aware Library using Boost.Proto
Designing Architecture-aware Library using Boost.ProtoDesigning Architecture-aware Library using Boost.Proto
Designing Architecture-aware Library using Boost.Proto
 

More from Christian Perone

Gradient-based optimization for Deep Learning: a short introduction
Gradient-based optimization for Deep Learning: a short introductionGradient-based optimization for Deep Learning: a short introduction
Gradient-based optimization for Deep Learning: a short introduction
Christian Perone
 
Bayesian modelling for COVID-19 seroprevalence studies
Bayesian modelling for COVID-19 seroprevalence studiesBayesian modelling for COVID-19 seroprevalence studies
Bayesian modelling for COVID-19 seroprevalence studies
Christian Perone
 
Uncertainty Estimation in Deep Learning
Uncertainty Estimation in Deep LearningUncertainty Estimation in Deep Learning
Uncertainty Estimation in Deep Learning
Christian Perone
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
Christian Perone
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Christian Perone
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Christian Perone
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
Christian Perone
 
Machine Learning com Python e Scikit-learn
Machine Learning com Python e Scikit-learnMachine Learning com Python e Scikit-learn
Machine Learning com Python e Scikit-learn
Christian Perone
 
Python - Introdução Básica
Python - Introdução BásicaPython - Introdução Básica
Python - Introdução BásicaChristian Perone
 
C++0x :: Introduction to some amazing features
C++0x :: Introduction to some amazing featuresC++0x :: Introduction to some amazing features
C++0x :: Introduction to some amazing features
Christian Perone
 

More from Christian Perone (10)

Gradient-based optimization for Deep Learning: a short introduction
Gradient-based optimization for Deep Learning: a short introductionGradient-based optimization for Deep Learning: a short introduction
Gradient-based optimization for Deep Learning: a short introduction
 
Bayesian modelling for COVID-19 seroprevalence studies
Bayesian modelling for COVID-19 seroprevalence studiesBayesian modelling for COVID-19 seroprevalence studies
Bayesian modelling for COVID-19 seroprevalence studies
 
Uncertainty Estimation in Deep Learning
Uncertainty Estimation in Deep LearningUncertainty Estimation in Deep Learning
Uncertainty Estimation in Deep Learning
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 
Deep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural ZooDeep Learning - Convolutional Neural Networks - Architectural Zoo
Deep Learning - Convolutional Neural Networks - Architectural Zoo
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 
Machine Learning com Python e Scikit-learn
Machine Learning com Python e Scikit-learnMachine Learning com Python e Scikit-learn
Machine Learning com Python e Scikit-learn
 
Python - Introdução Básica
Python - Introdução BásicaPython - Introdução Básica
Python - Introdução Básica
 
C++0x :: Introduction to some amazing features
C++0x :: Introduction to some amazing featuresC++0x :: Introduction to some amazing features
C++0x :: Introduction to some amazing features
 

Recently uploaded

Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
Google
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
abdulrafaychaudhry
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Enterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptxEnterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptx
QuickwayInfoSystems3
 
Nidhi Software Price. Fact , Costs, Tips
Nidhi Software Price. Fact , Costs, TipsNidhi Software Price. Fact , Costs, Tips
Nidhi Software Price. Fact , Costs, Tips
vrstrong314
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
abdulrafaychaudhry
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
ShamsuddeenMuhammadA
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 

Recently uploaded (20)

Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Enterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptxEnterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptx
 
Nidhi Software Price. Fact , Costs, Tips
Nidhi Software Price. Fact , Costs, TipsNidhi Software Price. Fact , Costs, Tips
Nidhi Software Price. Fact , Costs, Tips
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptxText-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
Text-Summarization-of-Breaking-News-Using-Fine-tuning-BART-Model.pptx
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 

PyTorch 2 Internals

  • 1. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch PyTorch 2 internals A not so short guide to recent PyTorch innovations Christian S. Perone (christian.perone@gmail.com) http://blog.christianperone.com London, UK, Dec 2023
  • 2. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Who Am I ▸ Christian S. Perone ▸ ML Research Engineer in London/UK ▸ Blog at ▸ blog.christianperone.com ▸ Open-source projects at ▸ https://github.com/perone ▸ Twitter @tarantulae
  • 3. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Disclaimer PyTorch development pace is so fast that no man ever steps in PyTorch code twice, for it’s not the same code and he’s not the same man. —Heraclitus, 500 BC
  • 4. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Section I ; Tensors <
  • 5. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Tensors Simply put, tensors are a generalization of vectors and matrices. In PyTorch, they are a multi-dimensional matrix containing elements of a single data type.
  • 6. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Tensors Simply put, tensors are a generalization of vectors and matrices. In PyTorch, they are a multi-dimensional matrix containing elements of a single data type. >>> import torch >>> t = torch.tensor([[1., -1.], [1., -1.]]) >>> t tensor([[ 1., -1.] [ 1., -1.]])
  • 7. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Tensors Simply put, tensors are a generalization of vectors and matrices. In PyTorch, they are a multi-dimensional matrix containing elements of a single data type. >>> import torch >>> t = torch.tensor([[1., -1.], [1., -1.]]) >>> t tensor([[ 1., -1.] [ 1., -1.]]) >>> t.dtype # They have a type torch.float32
  • 8. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Tensors Simply put, tensors are a generalization of vectors and matrices. In PyTorch, they are a multi-dimensional matrix containing elements of a single data type. >>> import torch >>> t = torch.tensor([[1., -1.], [1., -1.]]) >>> t tensor([[ 1., -1.] [ 1., -1.]]) >>> t.dtype # They have a type torch.float32 >>> t.shape # a shape torch.Size([2, 2])
  • 9. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Tensors Simply put, tensors are a generalization of vectors and matrices. In PyTorch, they are a multi-dimensional matrix containing elements of a single data type. >>> import torch >>> t = torch.tensor([[1., -1.], [1., -1.]]) >>> t tensor([[ 1., -1.] [ 1., -1.]]) >>> t.dtype # They have a type torch.float32 >>> t.shape # a shape torch.Size([2, 2]) >>> t.device # and live in some device device(type='cpu')
  • 10. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Tensors ▸ Although PyTorch has an elegant python first design, all PyTorch heavy work is actually implemented in C++. ▸ In Python, the integration of C++ code is (usually) done using what is called an extension;
  • 11. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Tensors ▸ Although PyTorch has an elegant python first design, all PyTorch heavy work is actually implemented in C++. ▸ In Python, the integration of C++ code is (usually) done using what is called an extension; ▸ PyTorch uses ATen, which is the foundational tensor operation library on which all else is built;
  • 12. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Tensors ▸ Although PyTorch has an elegant python first design, all PyTorch heavy work is actually implemented in C++. ▸ In Python, the integration of C++ code is (usually) done using what is called an extension; ▸ PyTorch uses ATen, which is the foundational tensor operation library on which all else is built; ▸ To do automatic differentiation, PyTorch uses Autograd, which is an augmentation on top of the ATen framework;
  • 13. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Tensors ▸ Although PyTorch has an elegant python first design, all PyTorch heavy work is actually implemented in C++. ▸ In Python, the integration of C++ code is (usually) done using what is called an extension; ▸ PyTorch uses ATen, which is the foundational tensor operation library on which all else is built; ▸ To do automatic differentiation, PyTorch uses Autograd, which is an augmentation on top of the ATen framework;
  • 14. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Quick recap Python objects typedef struct { PyObject_HEAD double ob_fval; } PyFloatObject;
  • 15. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Quick recap Python objects typedef struct { PyObject_HEAD double ob_fval; } PyFloatObject; typedef struct _object { Py_ssize_t ob_refcnt; struct _typeobject *ob_type; } PyObject;
  • 16. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Quick recap Python objects typedef struct { PyObject_HEAD double ob_fval; } PyFloatObject; typedef struct _object { Py_ssize_t ob_refcnt; struct _typeobject *ob_type; } PyObject; struct _typeobject *ob_type Py_ssize_t ob_refcnt object PyObject double ob_fval PyObject_HEAD object PyFloatObject
  • 17. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Quick recap Python objects struct THPVariable { PyObject_HEAD; c10::MaybeOwned<at::Tensor> cdata; PyObject* backward_hooks = nullptr; PyObject* post_accumulate_grad_hooks = nullptr; }; The TH prefix is from TorcH, and P means Python.
  • 18. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Quick recap Python objects struct THPVariable { PyObject_HEAD; c10::MaybeOwned<at::Tensor> cdata; PyObject* backward_hooks = nullptr; PyObject* post_accumulate_grad_hooks = nullptr; }; (object fields) PyObject_HEAD (w/ ref counter) object THPVariable variable_a variable_b Ref Count = 1 Ref Count = 2 The TH prefix is from TorcH, and P means Python.
  • 19. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch In Python, everything is an object >>> a = 300 >>> b = 300 >>> a is b False
  • 20. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch In Python, everything is an object >>> a = 300 >>> b = 300 >>> a is b False >>> a = 200 >>> b = 200 >>> a is b True
  • 21. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch In Python, everything is an object >>> a = 300 >>> b = 300 >>> a is b False >>> a = 200 >>> b = 200 >>> a is b True (object fields) PyObject_HEAD object PyIntObject a b Ref Count = 1 Ref Count = 2 (object fields) PyObject_HEAD object PyIntObject (object fields) PyObject_HEAD object PyIntObject a b Ref Count = 1 Ref Count = 1 A typical Python program spend much of its time allocating/deallocating integers. CPython then caches the small integers.
  • 22. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Zero-copying tensors It is very common to load tensors in numpy and convert them to PyTorch, or vice-versa; >>> np_array = np.ones((2,2)) >>> np_array array([[1., 1.], [1., 1.]]) Underline after an operation means an in-place operation.
  • 23. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Zero-copying tensors It is very common to load tensors in numpy and convert them to PyTorch, or vice-versa; >>> np_array = np.ones((2,2)) >>> np_array array([[1., 1.], [1., 1.]]) >>> torch_array = torch.tensor(np_array) >>> torch_array tensor([[1., 1.], [1., 1.]], dtype=torch.float64) Underline after an operation means an in-place operation.
  • 24. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Zero-copying tensors It is very common to load tensors in numpy and convert them to PyTorch, or vice-versa; >>> np_array = np.ones((2,2)) >>> np_array array([[1., 1.], [1., 1.]]) >>> torch_array = torch.tensor(np_array) >>> torch_array tensor([[1., 1.], [1., 1.]], dtype=torch.float64) >>> torch_array.add_(1.0) Underline after an operation means an in-place operation.
  • 25. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Zero-copying tensors It is very common to load tensors in numpy and convert them to PyTorch, or vice-versa; >>> np_array = np.ones((2,2)) >>> np_array array([[1., 1.], [1., 1.]]) >>> torch_array = torch.tensor(np_array) >>> torch_array tensor([[1., 1.], [1., 1.]], dtype=torch.float64) >>> torch_array.add_(1.0) >>> np_array array([[1., 1.], # array is intact, a copy was made [1., 1.]]) Underline after an operation means an in-place operation.
  • 26. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Zero-copying tensors ▸ Now imagine that you have a batch of 128 images, 3 channels each (RGB) and with size of 224x224; 0 1 1 1 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 1 1 1 1 0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 0 Column Row Channel ▸ This will yield a size in memory of ∼ 74MB. We don’t want to duplicate memory (except when copying them to discrete GPUs of course);
  • 27. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Zero-copying tensors Let’s see now a slightly different code using the function torch.from_numpy() this time: >>> np_array array([[1., 1.], [1., 1.]]) >>> torch_array = torch.from_numpy(np_array)
  • 28. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Zero-copying tensors Let’s see now a slightly different code using the function torch.from_numpy() this time: >>> np_array array([[1., 1.], [1., 1.]]) >>> torch_array = torch.from_numpy(np_array) >>> torch_array.add_(1.0)
  • 29. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Zero-copying tensors Let’s see now a slightly different code using the function torch.from_numpy() this time: >>> np_array array([[1., 1.], [1., 1.]]) >>> torch_array = torch.from_numpy(np_array) >>> torch_array.add_(1.0) >>> np_array array([[2., 2.], [2., 2.]])
  • 30. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Zero-copying tensors Let’s see now a slightly different code using the function torch.from_numpy() this time: >>> np_array array([[1., 1.], [1., 1.]]) >>> torch_array = torch.from_numpy(np_array) >>> torch_array.add_(1.0) >>> np_array array([[2., 2.], [2., 2.]]) The original numpy array was changed, because it used a zero-copy operation.
  • 31. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Zero-copying tensors Difference between in-place and standard operations might not be so clear in some cases: >>> np_array array([[1., 1.], [1., 1.]]) >>> torch_array = torch.from_numpy(np_array)
  • 32. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Zero-copying tensors Difference between in-place and standard operations might not be so clear in some cases: >>> np_array array([[1., 1.], [1., 1.]]) >>> torch_array = torch.from_numpy(np_array) >>> np_array = np_array + 1.0
  • 33. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Zero-copying tensors Difference between in-place and standard operations might not be so clear in some cases: >>> np_array array([[1., 1.], [1., 1.]]) >>> torch_array = torch.from_numpy(np_array) >>> np_array = np_array + 1.0 >>> torch_array tensor([[1., 1.], [1., 1.]], dtype=torch.float64)
  • 34. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Zero-copying tensors Difference between in-place and standard operations might not be so clear in some cases: >>> np_array array([[1., 1.], [1., 1.]]) >>> torch_array = torch.from_numpy(np_array) >>> np_array = np_array + 1.0 >>> torch_array tensor([[1., 1.], [1., 1.]], dtype=torch.float64) However, if you use np_array += 1.0 , that is an in-place operation that will change torch_array memory.
  • 35. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Zero-copying tensors at::Tensor tensor_from_numpy(PyObject* obj, (omitted)) { // some parts omitted for brevity auto array = (PyArrayObject*)obj; int ndim = PyArray_NDIM(array); auto sizes = to_aten_shape(ndim, PyArray_DIMS(array)); auto strides = to_aten_shape(ndim, PyArray_STRIDES(array)); void* data_ptr = PyArray_DATA(array); Py_INCREF(obj); return at::lift_fresh(at::from_blob( data_ptr, sizes, strides, [obj](void* data) { pybind11::gil_scoped_acquire gil; Py_DECREF(obj); }, at::device(kCPU).dtype(numpy_dtype_to_aten(PyArray_TYPE(array))) } Pay attention to the reference counting using Py_INCREF() and the call to at::from_blob() function.
  • 36. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Data pointers (object fields) data_pointer* object PyArrayObject (object fields) data_pointer* object FloatTensor The tensor FloatTensor did a copy of the numpy array data pointer and not of the contents. The reference is kept safe by the Python reference counting mechanism.
  • 37. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Tensor Storage The abstraction responsible for holding the data isn’t actually the Tensor , but the Storage .
  • 38. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Tensor Storage The abstraction responsible for holding the data isn’t actually the Tensor , but the Storage . struct C10_API StorageImpl : public c10::intrusive_ptr_target { // (...) private: // (...) DataPtr data_ptr_; SymInt size_bytes_; Allocator* allocator_; // (...) }
  • 39. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Tensor Storage The abstraction responsible for holding the data isn’t actually the Tensor , but the Storage . struct C10_API StorageImpl : public c10::intrusive_ptr_target { // (...) private: // (...) DataPtr data_ptr_; SymInt size_bytes_; Allocator* allocator_; // (...) } ▸ Holds a pointer to the raw data and contains information such as the size and allocator; ▸ Storage is a dumb abstraction, there is no metadata telling us how to interpret the data it holds;
  • 40. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Tensor Storage ▸ The Storage abstraction is very powerful because it decouples the raw data and how we can interpret it;
  • 41. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Tensor Storage ▸ The Storage abstraction is very powerful because it decouples the raw data and how we can interpret it; ▸ We can have multiple tensors sharing the same storage, but with different interpretations, also called views, but without duplicating memory:
  • 42. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Tensor Storage ▸ The Storage abstraction is very powerful because it decouples the raw data and how we can interpret it; ▸ We can have multiple tensors sharing the same storage, but with different interpretations, also called views, but without duplicating memory: >>> x = torch.ones((2, 2)) >>> x_view = x.view(4) >>> x_data = x.untyped_storage().data_ptr() >>> x_view_data = x_view.untyped_storage().data_ptr() >>> x_data == x_view_data True
  • 43. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Tensor Storage ▸ The Storage abstraction is very powerful because it decouples the raw data and how we can interpret it; ▸ We can have multiple tensors sharing the same storage, but with different interpretations, also called views, but without duplicating memory: >>> x = torch.ones((2, 2)) >>> x_view = x.view(4) >>> x_data = x.untyped_storage().data_ptr() >>> x_view_data = x_view.untyped_storage().data_ptr() >>> x_data == x_view_data True ▸ x_view is a different view (interpretation) of the same data present in the underlying storage that is shared between both tensors.
  • 44. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Memory allocators (CPU/GPU) ▸ The tensor storage can be allocated either in the CPU memory or GPU, therefore a mechanism is required to switch between these different allocations:
  • 45. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Memory allocators (CPU/GPU) ▸ The tensor storage can be allocated either in the CPU memory or GPU, therefore a mechanism is required to switch between these different allocations: struct C10_API Allocator { virtual ~Allocator() = default; virtual DataPtr allocate(size_t n) const = 0; virtual DeleterFnPtr raw_deleter() const {...} void* raw_allocate(size_t n) {...} void raw_deallocate(void* ptr) {...} }; ▸ There are Allocator s that will use the GPU allocators such as cudaMalloc() when the storage should be used for the GPU or posix_memalign() POSIX functions for data in the CPU memory.
  • 46. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch CUDA caching allocator PyTorch uses a CUDA caching allocator that maintains a cache of allocations with the Block structure:
  • 47. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch CUDA caching allocator PyTorch uses a CUDA caching allocator that maintains a cache of allocations with the Block structure: struct Block { int device; // gpu cudaStream_t stream; // allocation stream size_t size; // block size in bytes BlockPool* pool{nullptr}; // owning memory pool void* ptr{nullptr}; // memory address bool allocated{false}; // in-use flag Block* prev{nullptr}; // prev block if split from a Block* next{nullptr}; // next block if split from a // (...) } The torch.cuda.empty_cache() will release all unused blocks.
  • 48. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch The big picture (object fields) Storage *storage object Tensor Allocator *allocator (object fields) DataPtr data_ptr object Storage raw_deallocate() (object fields) raw_allocate() object Allocator Raw Data ▸ The Tensor has a Storage which in turn has a pointer to the raw data and to the Allocator to allocate memory according to the destination device.
  • 49. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Section II ; JIT <
  • 50. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch JIT - Just-in-time compiler ▸ PyTorch is eager by design, which means that it is easily hackable to debug, inspect, etc;
  • 51. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch JIT - Just-in-time compiler ▸ PyTorch is eager by design, which means that it is easily hackable to debug, inspect, etc; ▸ However, this poses problems for optimization and for decoupling it from Python (the model itself is Python code);
  • 52. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch JIT - Just-in-time compiler ▸ PyTorch is eager by design, which means that it is easily hackable to debug, inspect, etc; ▸ However, this poses problems for optimization and for decoupling it from Python (the model itself is Python code); ▸ PyTorch 1.0 introduced torch.jit , which has two main methods to convert a PyTorch model to a serializable and optimizable format; ▸ TorchScript was also introduced as a statically-typed subset of Python;
  • 53. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch JIT - Just-in-time compiler Two very different worlds with their own requirements. Prototype, debug, train, experiment EAGER MODE Optimization, other languages, deployment SCRIPT MODE ! " # tracing scripting
  • 54. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Tracing def my_function(x): if x.mean() > 1.0: r = torch.tensor(1.0) else: r = torch.tensor(2.0) return r
  • 55. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Tracing def my_function(x): if x.mean() > 1.0: r = torch.tensor(1.0) else: r = torch.tensor(2.0) return r >>> ftrace = torch.jit.trace(my_function, (torch.ones(2, 2)))
  • 56. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Tracing def my_function(x): if x.mean() > 1.0: r = torch.tensor(1.0) else: r = torch.tensor(2.0) return r >>> ftrace = torch.jit.trace(my_function, (torch.ones(2, 2))) >>> ftrace.graph graph(%x : Float(2, 2, strides=[2, 1], requires_grad=0, device=cpu)): %5 : Float(requires_grad=0, device=cpu) = prim::Constant[value={2}]() %6 : Device = prim::Constant[value="cpu"]() %7 : int = prim::Constant[value=6]() %8 : bool = prim::Constant[value=0]() %9 : bool = prim::Constant[value=0]() %10 : NoneType = prim::Constant() %11 : Float(requires_grad=0, device=cpu) = aten::to(%5, %6, %7, %8, %9, %12 : Float(requires_grad=0, device=cpu) = aten::detach(%11) return (%12)
  • 57. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Tracing To call the JIT’ed function, just call the forward() method: >>> x = torch.ones(2, 2) >>> ftrace.forward(x) tensor(2.)
  • 58. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Tracing To call the JIT’ed function, just call the forward() method: >>> x = torch.ones(2, 2) >>> ftrace.forward(x) tensor(2.) However, tracing will not record any control-flow like if statements or loops, it executes the code with the given context and creates the graph. You can see this limitation below: >>> x = torch.ones(2, 2).add_(1.0) >>> ftrace.forward(x) tensor(2.) According to my_function() , result should have been 1.0. Tracing also checks for differences between traced and Python function, but what about Dropout ?
  • 59. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Scripting Another alternative is to use scripting, where you can use decorators such as @torch.jit.script : @torch.jit.script def my_function(x): if bool(x.mean() > 1.0): r = 1 else: r = 2 return r
  • 60. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Scripting >>> my_function.graph graph(%x.1 : Tensor): %2 : NoneType = prim::Constant() %4 : float = prim::Constant[value=1.]() %9 : int = prim::Constant[value=1]() %10 : int = prim::Constant[value=2]() %3 : Tensor = aten::mean(%x.1, %2) %5 : Tensor = aten::gt(%3, %4) %7 : bool = aten::Bool(%5) %r : int = prim::If(%7) block0(): -> (%9) block1(): -> (%10) return (%r)
  • 61. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Scripting The my_function() is now a torch.jit.ScriptFunction : >>> type(my_function) torch.jit.ScriptFunction
  • 62. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Scripting The my_function() is now a torch.jit.ScriptFunction : >>> type(my_function) torch.jit.ScriptFunction When we check the results again: >>> x = torch.ones(2, 2) >>> my_function(x) 2
  • 63. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Scripting The my_function() is now a torch.jit.ScriptFunction : >>> type(my_function) torch.jit.ScriptFunction When we check the results again: >>> x = torch.ones(2, 2) >>> my_function(x) 2 >>> x = torch.ones(2, 2).add_(1.0) >>> my_function(x) 1 Control-flow logic was preserved !
  • 64. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Why TorchScript ? ▸ The concept of having a well-defined Intermediate Representation (IR) is very powerful, it’s the main concept behind LLVM platform as well;
  • 65. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Why TorchScript ? ▸ The concept of having a well-defined Intermediate Representation (IR) is very powerful, it’s the main concept behind LLVM platform as well; ▸ This opens the door to: ▸ Decouple the model (computationl graph) from Python runtime;
  • 66. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Why TorchScript ? ▸ The concept of having a well-defined Intermediate Representation (IR) is very powerful, it’s the main concept behind LLVM platform as well; ▸ This opens the door to: ▸ Decouple the model (computationl graph) from Python runtime; ▸ Use it in production with C++ (no GIL) or other languages;
  • 67. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Why TorchScript ? ▸ The concept of having a well-defined Intermediate Representation (IR) is very powerful, it’s the main concept behind LLVM platform as well; ▸ This opens the door to: ▸ Decouple the model (computationl graph) from Python runtime; ▸ Use it in production with C++ (no GIL) or other languages; ▸ Capitalize on optimizations (whole program);
  • 68. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Why TorchScript ? ▸ The concept of having a well-defined Intermediate Representation (IR) is very powerful, it’s the main concept behind LLVM platform as well; ▸ This opens the door to: ▸ Decouple the model (computationl graph) from Python runtime; ▸ Use it in production with C++ (no GIL) or other languages; ▸ Capitalize on optimizations (whole program); ▸ Split the development world of hackable and easy to debug from the world of putting these models in production and optimize them.
  • 69. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Building the IR To build the IR, PyTorch takes leverage of the Python Abstract Syntax Tree (AST) which is a tree representation of the syntactic structure of the source code. >>> ast_mod = ast.parse("print(1 + 2)") >>> astpretty.pprint(ast_mod.body[0], show_offsets=False) Expr( value=Call( func=Name(id='print', ctx=Load()), args=[ BinOp( left=Num(n=1), op=Add(), right=Num(n=2), ), ], keywords=[], ), )
  • 70. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Building the IR print(1 + 2)
  • 71. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch PyTorch JIT Phases Parsing ! Checking " Optimization # Translation $ Execution ○ & AST ' Code or
  • 72. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Optimizations Many optimizations can be used on the computational graph of the model, such as Loop Unrolling: for i.. i+= 1 for i.. i+= 4 for j.. for j.. code(i, j) code(i, j) code(i+1, j) code(i+2, j) code(i+3, j) remainder loop
  • 73. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Optimizations Also Peephole optimizations such as: x.t().t() = x
  • 74. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Optimizations Also Peephole optimizations such as: x.t().t() = x Example: def dumb_function(x): return x.t().t() >>> traced_fn = torch.jit.trace(dumb_function, ... torch.ones(2,2)) >>> traced_fn.graph_for(torch.ones(2,2)) graph(%x : Tensor): return (%x)
  • 75. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Optimizations Also Peephole optimizations such as: x.t().t() = x Example: def dumb_function(x): return x.t().t() >>> traced_fn = torch.jit.trace(dumb_function, ... torch.ones(2,2)) >>> traced_fn.graph_for(torch.ones(2,2)) graph(%x : Tensor): return (%x) Other optimizations include Constant Propagation, Dead Code Elimination (DCE), fusion, inlining, etc.
  • 76. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Serialization >>> resnet = torch.jit.trace(models.resnet18(), ... torch.rand(1, 3, 224, 224)) >>> resnet.save("resnet.pt")
  • 77. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Serialization >>> resnet = torch.jit.trace(models.resnet18(), ... torch.rand(1, 3, 224, 224)) >>> resnet.save("resnet.pt") $ file resnet.pt resnet.pt: Zip archive data
  • 78. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Serialization >>> resnet = torch.jit.trace(models.resnet18(), ... torch.rand(1, 3, 224, 224)) >>> resnet.save("resnet.pt") $ file resnet.pt resnet.pt: Zip archive data $ unzip resnet.pt Archive: resnet.pt extracting: resnet/version extracting: resnet/code/__torch__/torchvision/models/resnet extracting: resnet/data/0 (...)
  • 79. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Serialization code/resnet.py def forward(self: (...) resnet.ResNet, x: Tensor) -> Tensor: # (...) _0 = (bn1).forward((conv1).forward(x, ), ) _1 = (maxpool).forward((relu).forward(_0, ), ) _2 = (layer2).forward((layer1).forward(_1, ), ) _3 = (layer4).forward((layer3).forward(_2, ), ) input = torch.flatten((avgpool).forward(_3, ), 1) return (fc).forward(input, )
  • 80. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Using the model in C++ In the example below we load the exported TorchScript model and run the forward() using Torch’s C++ API. Example of loading a traced model in PyTorch C++ API: #include <torch/script.h> int main(int argc, const char* argv[]) { auto module = torch::jit::load("resnet.pt"); std::vector<torch::jit::IValue> inputs; inputs.push_back(torch::ones({1, 3, 224, 224})); at::Tensor output = module->forward(inputs).toTensor(); }
  • 81. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Executing Just like Python interpreter executes your code, PyTorch has an interpreter that executes the IR instructions: bool runImpl(Stack& stack) { // (...) omitted try { while (true) { Frame& frame = frames.back(); Instruction inst = INST_FETCH(0); switch (inst.op) { case INST(ENTER): { INST_GUARD; const auto& obj = peek(stack, 0, 1); TORCH_INTERNAL_ASSERT(obj.isObject()); entered_objects.push_back(obj); } INST_NEXT; // (...) omitted
  • 82. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Section III ; Dynamo <
  • 83. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Python Stack Frames Conceptually, an interpreter executes instructions within a context, which we refer to as frames.
  • 84. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Python Stack Frames Conceptually, an interpreter executes instructions within a context, which we refer to as frames. A function call generates a new frame, which is cleared when the function returns. This process is facilitated by a stack, with the frames being placed in order, thus giving rise to the term stack frames. Global Frame add add sub a = 1 b = 1 ret = 2 sub a = 2 b = 4 ret = -2 add a = 2 b = 2 ret = 4 function add(a, b) function sub(a, b)
  • 85. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch CPython Frame Evaluation Frame evaluation in CPython happens in _PyEval_EvalFrameDefault function. This is where the core of Python execution is, all bytecode gets executed here and this function is heavily optimized: for (;;) { opcode = next_uop->opcode; oparg = next_uop->oparg; // (...) case UNARY_NOT: { PyObject *value; PyObject *res; value = stack_pointer[-1]; assert(PyBool_Check(value)); res = Py_IsFalse(value) ? Py_True : Py_False; stack_pointer[-1] = res; break; } }
  • 86. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch TorchDynamo ▸ TorchScript can be limiting in some situations. TorchDynamo can overcome some of the limitations while still allowing unmodified Python code to be compiled; ▸ TorchDynamo was introduced as a way to acquire graphs, it uses a feature introduced in CPython 3.6 (PEP 523) where the frame evaluation API was exposed to allow specification of a per-interpreter function pointer to handle the evaluation of frames;
  • 87. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch TorchDynamo void enable_eval_frame_shim(PyThreadState* tstate) { #if PY_VERSION_HEX >= 0x03090000 if (_PyInterpreterState_GetEvalFrameFunc(tstate->interp) != &custom_eval_frame_shim) { DEBUG_CHECK(previous_eval_frame == NULL); previous_eval_frame = _PyInterpreterState_GetEvalFrameFunc(tstate->interp); _PyInterpreterState_SetEvalFrameFunc(tstate->interp, &custom_eval_frame_shim); } #else if (tstate->interp->eval_frame != &custom_eval_frame_shim) { // First call tstate->interp->eval_frame = &custom_eval_frame_shim; } #endif }
  • 88. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch TorchDynamo TorchDynamo behavior. Credit of the diagram to Jason Ansel.
  • 89. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch TorchDynamo ▸ TorchDynamo can switch back to the default Python frame evaluation when it is not able to capture the graph, creating what is called a graph break; ▸ The graph break can be created due to a lot of reasons such as: calling external libs such as numpy, converting tensors to Python types (e.g. Tensor.tolist() , Tensor.item() , etc); ▸ You can get the reason for each graph break and each graph break has obviously a performance penalty of switching back and forth between compiled code and Python code; ▸ TorchDynamo is used by torch.compile() but it is also exposed in the torch_dynamo module.
  • 90. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch TorchDynamo def my_fn(x): x = x * 2 x = x.tolist() x += [1, 2] return x def custom_backend(gm: torch.fx.GraphModule, example_inputs: List[torch.Tensor]): gm.graph.print_tabular() return gm.forward opt_my_fn = torch.compile(my_fn, backend=custom_backend) ret = opt_my_fn(torch.tensor([1., 2.])) Note that we are explicitly calling the Tensor.tolist() where Torch will have to convert tensors into a Python list object.
  • 91. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch TorchDynamo Our custom_backend was called just once with the following captured graph: opcode name target args kwargs ------------- ------ ----------------------- --------- -------- placeholder l_x_ L_x_ () {} call_function mul <built-in function mul> (l_x_, 2) {} output output output ((mul,),) {}
  • 92. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch TorchDynamo Our custom_backend was called just once with the following captured graph: opcode name target args kwargs ------------- ------ ----------------------- --------- -------- placeholder l_x_ L_x_ () {} call_function mul <built-in function mul> (l_x_, 2) {} output output output ((mul,),) {} This graph captures only the x = x * 2 part of the code, because of the graph break introduced due to the Tensor.tolist() operation. TorchDynamo then delegates the execution of x += [1, 2] back to Python’s default frame evaluation.
  • 93. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch TorchDynamo What happens if we modify our my_fn function to go back to a torch tensor and do a torch operation again ?
  • 94. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch TorchDynamo What happens if we modify our my_fn function to go back to a torch tensor and do a torch operation again ? def my_fn(x): x = x * 2 # To Python list x = x.tolist() x += [1, 2] # To torch tensor x = torch.tensor(x) x = x**2 return x
  • 95. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch TorchDynamo opcode name target args ------------- ------ ------------------------ --------- placeholder l_x_ L_x_ () call_function mul <built-in function mul> (l_x_, 2) output output output ((mul,),) opcode name target args ------------- ------ ------------------------- ------------------- call_function tensor <built-in method tensor> ([2.0, 4.0, 1, 2],) call_function pow_1 <built-in function pow> (tensor, 2) output output output ((pow_1,),) Note that our custom_backend was called twice with different graphs representing the first part of computation and the second part of the computation, without the pure-Python operations on the Python list .
  • 96. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch TorchDynamo ▸ So far, we haven’t actually compiled any of the graphs that our custom_backend backend received. We have been focusing only in the graph acquisition problem. ▸ To get performance improvements, we need to equip torch.compile() with a compiler that will convert the acquired graphs into efficient native code for different target hardware such as NVIDIA GPUs, Arm CPUs, RISC-V CPUs, TPUs, exotic edge devices such as your smart toaster, among others.
  • 97. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch TorchDynamo ▸ So far, we haven’t actually compiled any of the graphs that our custom_backend backend received. We have been focusing only in the graph acquisition problem. ▸ To get performance improvements, we need to equip torch.compile() with a compiler that will convert the acquired graphs into efficient native code for different target hardware such as NVIDIA GPUs, Arm CPUs, RISC-V CPUs, TPUs, exotic edge devices such as your smart toaster, among others. That’s where TorchInductor comes into play.
  • 98. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Section IV ; Inductor <
  • 99. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch AOTAutograd ▸ TorchDynamo generates Torch IR, which is a high-level representation that is not suitable to many different compiler backends; ▸ If we want to speed-up training as well, we need to capture the backward pass as well, hence the need for the AOTAutograd, where AOT stands for ahead-of-time; ▸ The AOTAutograd will generate ATen/Prims IR from tracing the forward and backward graph ahead of time;
  • 100. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch AOTAutograd ▸ TorchDynamo generates Torch IR, which is a high-level representation that is not suitable to many different compiler backends; ▸ If we want to speed-up training as well, we need to capture the backward pass as well, hence the need for the AOTAutograd, where AOT stands for ahead-of-time; ▸ The AOTAutograd will generate ATen/Prims IR from tracing the forward and backward graph ahead of time; ▸ IRs in PyTorch are a complex subject with many levels and many decompositions available; ▸ We will see an example of the difference between the graph generated by TorchDynamo vs the graph generated by AOTAutograd.
  • 101. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch The big picture Slide from “Deep Dive into TorchInductor and PT2 Backend Integration". Sherlock Huang et al.
  • 102. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Dynamo Torch IR Let’s take a look on the IR generated by TorchDynamo for the following model: class MLP(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(8, 10) def forward(self, x): x = self.fc1(x) x = torch.nn.functional.softmax(x, -1) return x
  • 103. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Dynamo Torch IR Let’s use the print_readable() method to show the graph this time: def custom_backend(gm: torch.fx.GraphModule, example_inputs: list[torch.Tensor]): gm.print_readable() return gm.forward model = MLP() my_fn_opt = torch.compile(model, backend=custom_backend) input_tensor = torch.randn(10, 8) ret = my_fn_opt(input_tensor)
  • 104. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Dynamo Torch IR This will yield the following IR: class GraphModule(torch.nn.Module): def forward(self, L_x_ : torch.Tensor): l_x_ = L_x_ # code: x = self.fc1(x) l__self___fc1 = self.L__self___fc1(l_x_); l_x_ = None # code: x = torch.nn.functional.softmax(x, -1) softmax = torch.nn.functional.softmax(l__self___fc1, -1); l__self___fc1 = None return (softmax,)
  • 105. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch AOTAutograd ATen IR Let’s now change the backend a bit to use AOTAutograd: from torch._functorch.aot_autograd import aot_module_simplified def custom_backend(gm: torch.fx.GraphModule, example_inputs: list[torch.Tensor]): def my_compiler(gm, example_inputs): gm.print_readable() return gm.forward return aot_module_simplified( gm, example_inputs, fw_compiler=my_compiler )
  • 106. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch AOTAutograd ATen IR And here we are with the AOTAutograd generated IR (with = None ’s and some comments removed for brevity): class GraphModule(torch.nn.Module): def forward(self, primals_1: f32[10, 8], primals_2: f32[10], primals_3: f32[10, 8]): # code: x = self.fc1(x) t: f32[8, 10] = torch.ops.aten.t.default(primals_1) addmm: f32[10, 10] = torch.ops.aten.addmm.default(primals_2, primals_3, t) # code: x = torch.nn.functional.softmax(x, -1) _softmax: f32[10, 10] = torch.ops.aten._softmax.default(addmm, -1, False) return [_softmax, primals_3, _softmax]
  • 107. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch TorchInductor Inductor takes the graph produced by AOTAutograd (consisting of ATen/Prim IR) and perform further graph decompositions: def forward(self, arg0_1: f32[10, 8], arg1_1: f32[10], arg2_1: f32[10, 8]): # code: x = self.fc1(x) permute: f32[8, 10] = torch.ops.aten.permute.default(arg0_1, [1, 0]) addmm: f32[1024, 10] = torch.ops.aten.addmm.default(arg1_1, arg2_1, permute); # code: x = torch.nn.functional.softmax(x, -1) amax: f32[10, 1] = torch.ops.aten.amax.default(addmm, [-1], True) sub: f32[10, 10] = torch.ops.aten.sub.Tensor(addmm, amax) exp: f32[10, 10] = torch.ops.aten.exp.default(sub) sum_1: f32[10, 1] = torch.ops.aten.sum.dim_IntList(exp, [-1], True) div: f32[10, 10] = torch.ops.aten.div.Tensor(exp, sum_1) return (div,)
  • 108. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch TorchInductor ▸ After that, the graph goes to the scheduling phase where fusion can happen and then to the appropriate TorchInductor backend; ▸ TorchInductor can generate C++/OpenMP code or Triton. The generated kernels are then called by a generated wrapper; ▸ Industry is collaborating with backend optimizations (e.g. Intel speedups for CPU bfloat16 in some recent processors);
  • 109. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch TorchInductor ▸ After that, the graph goes to the scheduling phase where fusion can happen and then to the appropriate TorchInductor backend; ▸ TorchInductor can generate C++/OpenMP code or Triton. The generated kernels are then called by a generated wrapper; ▸ Industry is collaborating with backend optimizations (e.g. Intel speedups for CPU bfloat16 in some recent processors); ▸ We will see now a part of a C++ kernel generated by TorchInductor for the fused softmax with CPU tensors (in MacOS as an example).
  • 110. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch TorchInductor extern "C" void kernel(float* in_out_ptr0, float* out_ptr0, float* out_ptr1) { auto in_ptr0 = in_out_ptr0; { #pragma GCC ivdep for(long i0=static_cast<long>(0L); i0<static_cast<long>(10L); i0+=static_cast<long>(1L)) { float tmp_acc0 = -std::numeric_limits<float>::infinity(); for(long i1=static_cast<long>(0L); i1<static_cast<long>(10L); i1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(i1 + (10L*i0))]; tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0); } out_ptr0[static_cast<long>(i0)] = tmp_acc0; } }
  • 111. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch TorchInductor Now, if we run the same code with CUDA tensors, what we will get is the Triton kernel below: @triton.jit def triton_(in_ptr0, out_ptr2, xnumel, rnumel, XBLOCK : tl.constexpr): # ... (omitted for brevity) tmp0 = tl.load(in_ptr0 + (r1 + (10*x0)), rmask & xmask, other=0) tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK]) tmp3 = tl.where(rmask & xmask, tmp1, float("-inf")) tmp4 = triton_helpers.max2(tmp3, 1)[:, None] tmp5 = tmp0 - tmp4 tmp6 = tl.exp(tmp5) tmp7 = tl.broadcast_to(tmp6, [XBLOCK, RBLOCK]) tmp9 = tl.where(rmask & xmask, tmp7, 0) tmp10 = tl.sum(tmp9, 1)[:, None] tmp11 = tmp6 / tmp10 tl.store(out_ptr2 + (r1 + (10*x0)), tmp11, rmask & xmask)
  • 112. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Section V ; Torch Export <
  • 113. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Torch Export Path ▸ Torch Export ( torch.export ) was created to do whole-graph capture; ▸ As we discussed earlier, TorchDynamo can create graph breaks and do this back-and-forth with the Python interpreter; ▸ This cooperative dynamic with Python makes it difficult to be able to embed it in environments without the Python runtime;
  • 114. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Torch Export Path ▸ Torch Export ( torch.export ) was created to do whole-graph capture; ▸ As we discussed earlier, TorchDynamo can create graph breaks and do this back-and-forth with the Python interpreter; ▸ This cooperative dynamic with Python makes it difficult to be able to embed it in environments without the Python runtime; ▸ torch.export relies on the torch.compile stack, but with important differences: it doesn’t fallback to Python interpreter, so captured graph cannot have graph breaks and code changes can be required; ▸ The main goal of torch.export is to provide normalized IR using Core ATen IR opset that can be loaded and executed in different languages/environments.
  • 115. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Dynamo Torch IR Let’s use the same code we used earlier with TorchDynamo and export it with torch.export : class MLP(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(8, 10) def forward(self, x): x = self.fc1(x) x = torch.nn.functional.softmax(x, -1) return x
  • 116. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Torch Export >>> import torch.export as export >>> model = MLP() >>> sample = torch.randn(10, 8) >>> exp = export.export(model, (sample,)) >>> exp <torch.export.ExportedProgram object at 0x163c8ad10> >>> print(exp)
  • 117. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Torch Export >>> import torch.export as export >>> model = MLP() >>> sample = torch.randn(10, 8) >>> exp = export.export(model, (sample,)) >>> exp <torch.export.ExportedProgram object at 0x163c8ad10> >>> print(exp) class GraphModule(torch.nn.Module): def forward(self, arg0_1: f32[10, 8], arg1_1: f32[10], arg2_1: f32[10, 8]): permute: f32[8, 10] = torch.ops.aten.permute.default(arg0_1, [1, 0]) addmm: f32[10, 10] = torch.ops.aten.addmm.default(arg1_1, arg2_1, permute) _softmax: f32[10, 10] = torch.ops.aten._softmax.default(addmm, -1, False) return (_softmax,) (...)
  • 118. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Torch Export Let’s serialize the exported graph: >>> export.save(exp, "serialized_graph.pt2")
  • 119. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Torch Export Let’s serialize the exported graph: >>> export.save(exp, "serialized_graph.pt2") We can see that the format is a zip archive: $ file serialized_graph.pt2 serialized_graph.pt2: Zip archive data
  • 120. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Torch Export Let’s serialize the exported graph: >>> export.save(exp, "serialized_graph.pt2") We can see that the format is a zip archive: $ file serialized_graph.pt2 serialized_graph.pt2: Zip archive data ... and we can extract to inspect: $ unzip serialized_graph.pt2 extracting: serialized_exported_program.json extracting: serialized_state_dict.json extracting: version
  • 121. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Torch Export There is a version file: $ cat version 1
  • 122. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Torch Export There is a version file: $ cat version 1 A serialized_exported_program.json : $ file serialized_exported_program.json serialized_exported_program.json: JSON data
  • 123. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Torch Export There is a version file: $ cat version 1 A serialized_exported_program.json : $ file serialized_exported_program.json serialized_exported_program.json: JSON data And the serialized_state_dict.json : $ file serialized_state_dict.json serialized_state_dict.json: Zip archive data Not sure why PyTorch uses a json extension for a Zip archive.
  • 124. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Torch Export $ jq "keys" serialized_exported_program.json ["equality_constraints", "example_inputs", "graph_module", "opset_version", "range_constraints", "schema_version"]
  • 125. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Torch Export $ jq "keys" serialized_exported_program.json ["equality_constraints", "example_inputs", "graph_module", "opset_version", "range_constraints", "schema_version"] The graph is in the graph_module and there is a opset_version with the used ATen IR opset version: $ jq .opset_version serialized_exported_program.json { "aten": 10 }
  • 126. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Torch Export Let’s see the nodes from the graph: $ jq ".graph_module.graph.nodes[].target" (...) "torch.ops.aten.permute.default" "torch.ops.aten.addmm.default" "torch.ops.aten._softmax.default"
  • 127. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Torch Export Let’s see the nodes from the graph: $ jq ".graph_module.graph.nodes[].target" (...) "torch.ops.aten.permute.default" "torch.ops.aten.addmm.default" "torch.ops.aten._softmax.default" Let’s see the outputs of the graph: $ jq .graph_module.graph.outputs (...) [{ "as_none": null, "as_tensor": { "name": "_softmax" }, "as_tensors": null, "as_int": null, "as_ints": null, "..." }]
  • 128. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Torch Export ▸ You might need to rewrite your code if you use torch.export , especially if you have graph breaks and data/shape-dependent control flow as well; ▸ torch.export is, nevertheless, a very nice direction towards standardization of the IR. If vendors adopt it, you can skip intermediate representations (e.g. ONNX) and many nightmares; ▸ APIs, IRs opsets are very recent and subject to changes, so keep an eye on its development;
  • 129. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Torch Export ▸ You might need to rewrite your code if you use torch.export , especially if you have graph breaks and data/shape-dependent control flow as well; ▸ torch.export is, nevertheless, a very nice direction towards standardization of the IR. If vendors adopt it, you can skip intermediate representations (e.g. ONNX) and many nightmares; ▸ APIs, IRs opsets are very recent and subject to changes, so keep an eye on its development; ▸ We have now a serialized graph, let’s now find out how we can actually execute it outside of Python. That’s where ExecuTorch joins the party !
  • 130. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Section VI ; ExecuTorch <
  • 131. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch ExecuTorch ▸ ExecuTorch (ET) leverages PyTorch 2 compiler and export path to enable on-device execution of PyTorch models; ▸ Portable runtime, low memory footprint and doesn’t use TorchScript (as in PyTorch mobile); ▸ Still a lot of on-going development, this talk is aligned with the v0.1.0 branch of ExecuTorch, a preview release for testing and evaluation; ▸ Multiple backends (arm, qualcomm, xnnpack, apple, etc) where ExecuTorch can delegate to DSPs, NPUs, CPUs, etc, being developed; ▸ Hope to see more industry collaboration.
  • 132. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch ExecuTorch Executorch has two main phases: AOT (Ahead of Time) This is the program preparation (before the execution). ExecuTorch leverages TorchDynamo and PyTorch export to convert the model into an IR. Optionally, backends can plug-in in this phase as well in what is called backend delegation for AOT.
  • 133. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch ExecuTorch Executorch has two main phases: AOT (Ahead of Time) This is the program preparation (before the execution). ExecuTorch leverages TorchDynamo and PyTorch export to convert the model into an IR. Optionally, backends can plug-in in this phase as well in what is called backend delegation for AOT. Runtime ExecuTorch runtime executes models on the edge devices (which can be a high-end or very constrained edge device). It will initialize, execute and release resources. It will also initialize delegates and (surprise) delegate execution of the program (or parts of it) to them as well.
  • 134. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch ExecuTorch Concept Overview Image from ExecuTorch documentation, December 2023.
  • 135. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch ExecuTorch Lowering ExecuTorch performs progressive lowering of the graph or parts of the graph to different IRs, so the operations get progressively closer to the hardware: ▸ Edge dialect: all operators from predefined operator set and inputs/outputs must be tensor
  • 136. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch ExecuTorch Lowering ExecuTorch performs progressive lowering of the graph or parts of the graph to different IRs, so the operations get progressively closer to the hardware: ▸ Edge dialect: all operators from predefined operator set and inputs/outputs must be tensor ▸ Backend dialect: immediate result of exporting Edge dialect to a particular backend. Allows the introduction of target-specific operators (that are aware of the hardware they will run later)
  • 137. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch ExecuTorch Memory Planning Before serializing the program ( .pte file), ExecuTorch performs memory planning. It uses size and lifespan of mutable tensors to plan their location (offset) in fixed size memory arenas:
  • 138. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch ExecuTorch Memory Planning Before serializing the program ( .pte file), ExecuTorch performs memory planning. It uses size and lifespan of mutable tensors to plan their location (offset) in fixed size memory arenas: Naive algorithm Concatenates all the tensors together in a linear memory without considering any memory re-use.
  • 139. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch ExecuTorch Memory Planning Before serializing the program ( .pte file), ExecuTorch performs memory planning. It uses size and lifespan of mutable tensors to plan their location (offset) in fixed size memory arenas: Naive algorithm Concatenates all the tensors together in a linear memory without considering any memory re-use. Greedy algorithm Tries to re-use the already allocated memory and choose based on the best-fit criteria.
  • 140. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch ExecuTorch Memory Planning Before serializing the program ( .pte file), ExecuTorch performs memory planning. It uses size and lifespan of mutable tensors to plan their location (offset) in fixed size memory arenas: Naive algorithm Concatenates all the tensors together in a linear memory without considering any memory re-use. Greedy algorithm Tries to re-use the already allocated memory and choose based on the best-fit criteria. program = edge_program.to_executorch( # Example exir.ExecutorchBackendConfig( memory_planning_pass=MemoryPlanningPass( memory_planning_algo="greedy", # (...) )))
  • 141. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch ExecuTorch Export Let’s export the same model that we had before: class MLP(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(8, 10) def forward(self, x): x = self.fc1(x) x = torch.nn.functional.softmax(x, -1) return x
  • 142. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch ExecuTorch Export from torch._export import capture_pre_autograd_graph from executorch.exir import to_edge model = MLP() model = model.eval() inputs = (torch.randn(10, 8),) pre_atgrad_aten_ir = capture_pre_autograd_graph(model, inputs) aten_ir = export.export(pre_atgrad_aten_ir, inputs) edge_ir = to_edge(aten_ir) program = edge_ir.to_executorch() with open("model.pte", "wb") as fhandle: fhandle.write(program.buffer)
  • 143. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch ExecuTorch Serialization The serialization of the program uses the same memory efficient format used in TensorFlow Lite: FlatBuffers. The Program schema is defined in the schema/program.fbs file: // (...) omitted for brevity table Program { // Schema version. version:uint; // List of ExecutionPlans that make up the program. // Each ExecutionPlan corresponds with a different // entry point into the model. execution_plan:[ExecutionPlan]; // (...) omitted for brevity }
  • 144. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch ExecuTorch Serialization Let’s see how our exported program looks like by converting the binary flatbuffer to json: $ flatc --strict-json --raw-binary -t executorch/schema/program.fbs -- ./model.pte
  • 145. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch ExecuTorch Serialization Let’s see how our exported program looks like by converting the binary flatbuffer to json: $ flatc --strict-json --raw-binary -t executorch/schema/program.fbs -- ./model.pte $ jq ".execution_plan[0].name" model.json "forward"
  • 146. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch ExecuTorch Serialization Let’s see how our exported program looks like by converting the binary flatbuffer to json: $ flatc --strict-json --raw-binary -t executorch/schema/program.fbs -- ./model.pte $ jq ".execution_plan[0].name" model.json "forward" $ jq ".execution_plan[0].operators[].name" model.json "aten::permute_copy" "aten::addmm" "aten::_softmax"
  • 147. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Memory Planning in Action Let’s see how one tensor looks like in the Program : // (...) "val_type": "Tensor", "val": { "scalar_type": "FLOAT", "sizes": [10, 8], "dim_order": [0, 1], "allocation_info": { "memory_id": 1, "memory_offset": 800 } } // (...)
  • 148. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Memory Planning in Action Constant tensors (e.g. weights in a Linear layer) are handled differently than mutable tensors: Result<void*> getTensorDataPtr(...) { if (s_tensor->constant_buffer_idx() > 0) { auto data = program->get_constant_buffer_data( s_tensor->constant_buffer_idx()); return const_cast<void*>(data.get()); } const executorch_flatbuffer::AllocationDetails* allocation_info = s_tensor->allocation_info(); if (allocation_info != nullptr) { const uint32_t memory_id = allocation_info->memory_id() - 1; return allocator->get_offset_address( memory_id, allocation_info->memory_offset(), nbytes); } // (...) }
  • 149. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch ExecuTorch Concept Overview Image from ExecuTorch documentation, December 2023.
  • 150. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch ExecuTorch Runtime ExecuTorch runtime is a portable runtime: ▸ C++11 compatible, no exceptions or RTTI ▸ They provide cmake and buck2 build support ▸ Memory allocation mechanism is provided by the user, the core runtime doesn’t do memory allocations (although backend kernels might, but disencouraged to do so) ▸ Can have different memory regions for mutable tensors (e.g. SRAM/DRAM placement) ▸ Without kernels or backend, runtime is 50kb
  • 151. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch ExecuTorch Runtime We have now the exported Program and want to load the model.pte and execute it on the edge. ▸ At this point, your next steps will depend on the edge device you want the runtime to run; ▸ There are many examples in ExecuTorch on how to deploy using XNNPACK, or targeting ARM (e.g. Ethos-U NPU), Qualcomm Hexagon NPU, DSPs, building Android/iOS apps, etc;
  • 152. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch ExecuTorch Runtime We have now the exported Program and want to load the model.pte and execute it on the edge. ▸ At this point, your next steps will depend on the edge device you want the runtime to run; ▸ There are many examples in ExecuTorch on how to deploy using XNNPACK, or targeting ARM (e.g. Ethos-U NPU), Qualcomm Hexagon NPU, DSPs, building Android/iOS apps, etc; ▸ For this tutorial, I will target a Pixel Watch 2 device (with a Cortex A53) and use the portable CPU kernels.
  • 153. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Loading the Program Let’s start looking at how we can use the runtime in C++ by first loading the serialized Program : Result<FileDataLoader> loader = FileDataLoader::from(model_path); Result<Program> program = Program::load(&loader.get()); Result<MethodMeta> method_meta = program->method_meta("forward");
  • 154. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Loading the Program Let’s start looking at how we can use the runtime in C++ by first loading the serialized Program : Result<FileDataLoader> loader = FileDataLoader::from(model_path); Result<Program> program = Program::load(&loader.get()); Result<MethodMeta> method_meta = program->method_meta("forward"); ▸ The .pte file is opened ▸ File header is parsed ▸ Flatbuffer is created with serialized data
  • 155. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Memory Affair Let’s now create an allocator method_allocator for the method structure: static uint8_t method_allocator_pool[4 * 1024U * 1024U]; MemoryAllocator method_allocator{ MemoryAllocator(sizeof(method_allocator_pool), method_allocator_pool)};
  • 156. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Memory Affair Let’s now create an allocator method_allocator for the method structure: static uint8_t method_allocator_pool[4 * 1024U * 1024U]; MemoryAllocator method_allocator{ MemoryAllocator(sizeof(method_allocator_pool), method_allocator_pool)}; Most of this code is from executor_runner.cpp in ExecuTorch. Don’t get too attached to the idiosyncrasies, but to what it is actually doing.
  • 157. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Memory Affair Let’s allocate now the planned buffers for the mutable tensors: std::vector<std::unique_ptr<uint8_t[]>> buffers; std::vector<Span<uint8_t>> spans; size_t n_planned_buffers = method_meta->num_memory_planned_buffers(); for (size_t id = 0; id < n_planned_buffers; ++id) { size_t buffer_size = method_meta->memory_planned_buffer_size(id).get(); buffers.push_back(std::make_unique<uint8_t[]>(buffer_size)); spans.push_back({buffers.back().get(), buffer_size}); } HierarchicalAllocator planned_memory({buffers.data(), spans.size()}); MemoryManager memory_manager(&method_allocator, &planned_memory);
  • 158. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Memory Affair We can now finally execute the method: Result<Method> method = program->load_method("forward", &memory_manager); method.set_input(...) // set the method inputs Error status = method->execute(); // Get the outputs into "outputs" std::vector<EValue> outputs(method->outputs_size()); status = method->get_outputs(outputs.data(), outputs.size());
  • 159. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Our victim today ▸ Google Pixel Watch 2 ▸ Qualcomm SW5100, 4x Cortex A53 cores ▸ 2GB of RAM ▸ Android Wear OS 4
  • 160. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Our victim today ▸ Google Pixel Watch 2 ▸ Qualcomm SW5100, 4x Cortex A53 cores ▸ 2GB of RAM ▸ Android Wear OS 4 ▸ I’m not affiliated with Google, this happened to be the first small device in front of me. I’m planning to experiment with a more constrained RP2040 (Raspberry Pi Pico, Cortex-M0+) next time.
  • 161. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Which CPU is that Pixel Watch 2 runs Android, let’s see the architecture: $ uname -a Linux localhost 5.15.104-android13-(...) armv8l Toybox Interestingly this SoC supports armv8 64-bits, but it is running on 32-bits with the kernel compiled for armv8l (32-bits, little ending).
  • 162. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Which CPU is that Pixel Watch 2 runs Android, let’s see the architecture: $ uname -a Linux localhost 5.15.104-android13-(...) armv8l Toybox Interestingly this SoC supports armv8 64-bits, but it is running on 32-bits with the kernel compiled for armv8l (32-bits, little ending). $ cat /proc/cpuinfo processor : 0 model name : ARMv8 Processor rev 4 (v8l) Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt lpae evtstrm aes pmull sha1 sha2 crc32 CPU implementer : 0x51 CPU architecture: 8 CPU variant : 0xa CPU part : 0x801 CPU revision : 4 (...)
  • 163. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Toolchains Everywhere Let’s prepare to use the Android toolchain for cross-compilation: Download the Android NDK and set its path: $ export ANDROID_NDK=/opt/android-ndk-r26b
  • 164. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Toolchains Everywhere Let’s prepare to use the Android toolchain for cross-compilation: Download the Android NDK and set its path: $ export ANDROID_NDK=/opt/android-ndk-r26b Then we just add some variables into CMakeLists.txt in ExecuTorch: set(CMAKE_SYSTEM_NAME Android) set(CMAKE_SYSTEM_VERSION 24) set(CMAKE_ANDROID_ARCH_ABI armeabi-v7a) I only found the compatible armeabi-v7a architecture in Android NDK, since armv8l is backwards compatible with ARMv7, I’m using this one.
  • 165. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Selective Build There are many ways of building our application and linking to ExecuTorch, what we will use is the selective build, which will select only a few kernels to be compiled and we will use MobileNetV2.
  • 166. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Selective Build There are many ways of building our application and linking to ExecuTorch, what we will use is the selective build, which will select only a few kernels to be compiled and we will use MobileNetV2. Luckily, ExecuTorch has some scripts to help with exporting the model and compiling. Let’s export MobileNetV2 ( mv2 ): $ python3 -m examples.portable.scripts.export --model_name="mv2" This will create the serialized program mv2.pte .
  • 167. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Selective Build There are many ways of building our application and linking to ExecuTorch, what we will use is the selective build, which will select only a few kernels to be compiled and we will use MobileNetV2. Luckily, ExecuTorch has some scripts to help with exporting the model and compiling. Let’s export MobileNetV2 ( mv2 ): $ python3 -m examples.portable.scripts.export --model_name="mv2" This will create the serialized program mv2.pte . Now we can compile it with cmake : $ examples/selective_build/test_selective_build.sh cmake
  • 168. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Selective Build You can look at the test_selective_build.sh but the important bit here is the selected ops list we are building in our application: $ cmake (...) -DEXECUTORCH_SELECT_OPS_LIST="aten::convolution.out, (...) aten::mean.out,aten::view_copy.out,aten::permute_copy.out, aten::addmm.out,aten,aten::clone.out" Instead of building all kernels, we are selecting only a few of them. This is very important for more constrained devices.
  • 169. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Selective Build You can look at the test_selective_build.sh but the important bit here is the selected ops list we are building in our application: $ cmake (...) -DEXECUTORCH_SELECT_OPS_LIST="aten::convolution.out, (...) aten::mean.out,aten::view_copy.out,aten::permute_copy.out, aten::addmm.out,aten,aten::clone.out" Instead of building all kernels, we are selecting only a few of them. This is very important for more constrained devices. We just copy our binary model_app and the exported model mv2.pte to the Pixel Watch 2 using Android adb tool and then run the model: $ model_app --model_path="mv2.pte"
  • 170. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Selective Build The output of executing the example app in the Pixel Watch 2 will be something like this: Output 0: tensor(sizes=[1, 1000], [ -0.50986, 0.300638, 0.0953863, 0.147721, 0.231201, 0.338555, 0.20689, -0.0575741, -0.389267, -0.0606858, -0.0213996, -0.121034, -0.288955, 0.134052, -0.171977, -0.060362, 0.0203591, -0.0585306, 0.337859, -0.0718654, 0.490758, 0.524143, 0.197859, 0.122067, -0.35913, 0.10946, 0.347745, 0.478512, 0.226557, 0.0363519, (...) Showing the 1000 class logits for the input (all 1’s in our case).
  • 171. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Thanks ! I hope you enjoyed this presentation ! This was an overview of the internals of some of the projects in the PyTorch ecosystem that came out recently. I skipped some other important aspects such as distributed training, but hopefully it will come soon in the next iteration of this presentation. Huge thanks to all PyTorch contributors !
  • 172. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Section VII ; Q&A <
  • 173. PyTorch 2 internals - Christian S. Perone (2023) Tensors JIT Dynamo Inductor Torch Export ExecuTorch Q&A Thanks !