Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Toward Hybrid Cloud Serverless Transparency with Lithops Framework
1. Toward hybrid cloud serverless transparency with
Lithops framework
Gil Vernik, IBM Research
gilv@il.ibm.com
2. About myself
• Gil Vernik
• IBM Research from 2010
• Architect, 25+ years of development experience
• Active in open source
• Hybrid cloud. Big Data Engines. Serverless Twitter: @vernikgil
https://www.linkedin.com/in/gil-vernik-1a50a316/
2
3. All material and code presented in this talk are open source.
Comments, suggestions or code contributions are welcome
This project has received funding from the European Union’s Horizon 2020
research and innovation programme under grant agreement No 825184.
Photos used in this presentaton by Unknown Author are licensed under CC
BY-SA-NC
3
8. Serverless paradigm
It doesn't mean that
there are no servers
it means that we don't
need to worry about
servers
In the Serverless
world, we simply
“deliver” our code,
software stack or
workload
The “serverless
backend engine” will
take of provisioning
the server(s) and
execute the code
8
9. This Photo by
Unknown
Author is
licensed under
CC BY-NC-ND
This way, we don’t
think anywhere about
servers and so the
name Serverless
9
11. User code with
Software stack
and
dependencies
Serverless user experience
IBM Cloud Functions
The more we focus on the business logic and
less how to deploy and execute – so better our
‘serverless’ experience
11
13. Code, dependencies and containers
Docker image with
dependencies
Dependencies, packages,
software, etc.
Code as part of
Docker image?
13
14. The gap between the business logic and the boileplate code
User wants to run ML algorithms on
the colors extracted from images.
He wrote a function that extracts colors
from a single image and tested it works
He now wants to run this function on
millions of images, located in different
storage places (cloud object storage,
local CEPH, etc.) , extract all the colors
and inject them into ML framework for
further processing
Example
15. Bring the code to data or move data to code
• How to run the code as close
as possible to the data?
Local? Cloud? Hybrid?
How to collect results?
• Move as less data as possible
• The boiler plate code to access
storage
User wants to run ML algorithms on
the colors extracted from images.
He wrote a function that extracts color
from a single image and tested it works
He now wants to run this function on
millions of images, located in different
storage places (cloud object storage,
local CEPH, etc.) , extract all the colors
and inject them into ML framework for
further processing
15
16. The gap between the business logic and the boileplate code
• How to partition input data?
• How to list millions of images?
• How much memory needed to
process a single image
assuming images of different
size
• How to deploy the code with
dependencies
• and so on…
User wants to run ML algorithms on
the colors extracted from images.
He wrote a function that extracts color
from a single image and tested it works
He now wants to run this function on
millions of images, located in different
storage places (cloud object storage,
local CEPH, etc.) , extract all the colors
and inject them into ML framework for
further processing
16
17. Know APIs and the semantics
Developers need to know vendor documentation and APIs, use CLI tools, learn how
to deploy code and dependencies, how to retrieve results, etc.
Each cloud vendor has it’s own API and semantics
17
18. User software
stack
User software
stack
Containeriazed model is not only to run and deploy the code
User software
stack
• How to “containerize” code or scale software components from an existing application without major disruption
and without rewriting all from scratch?
• How to scale the code decide the right parallelism on terabytes of data without become a systems expert in
scaling the code and learn storage semantics?
• How to partition input data, generate output, leverage cache if needed
COS, Ceph, databases, in
memory cache, etc.
Software stack
18
20. Push to the serverless with Lithops framework
• Lithops is a novel Python framework designed to scale code or applications at massive scale, exploiting
almost any execution backend platform, hybrid clouds, public clouds, etc.
• A single Lithops API against any backend engine
• Open source http://lithops.cloud , Apache License 2.0
• Leaded by IBM Research Haifa and URV university
• Can benefit to variety of use cases
Serverless for more use cases
The easy move to serverless
20
21. Lithops to scale Python code and applications
input data = array, COS, etc.
def my_func(x):
//business logic
Lithops
print (lt.get_result())
IBM Cloud Functions
import lithops
lt = lithops.FunctionExecutor()
lt.map(my_func, input_data))
Lithops
Lithops
21
26. More on Lithops
• Truly serverless, lightweight framework, without need to deploy additional cluster on
top of serverless engine.
• Scales from 0 to many
• Can deploy code to any compute backend and simplify hybrid use cases with a single
Lithops API
• Data driven with advanced data partitioner to support processing of large input datasets
• Lithops can execute any native code, not limited to Python code
• Hides complexity of sharing data between compute stages and supports shuffle
• Fits well into workflow orchestrator frameworks
26
27. User code or
application
Data driven flows with Lithops
27
User code or
application
Lithops leverage user business logic to simply the containerization process
• Decides the right scale, coordinate parallel inovocations, etc.
• Lithops runtime handle all accesses to the datasets, data partitionning, use cache if needed, monitor progress
• Lithops runtime allows to or exchange data between serverless invocations, implements shuffle
COS, Ceph, databases, in
memory cache, etc.
User software
stack
User software
stack
User software
stack
Lithops runtime
Software stack
Lithops client
automatically deploys
user’s software stack as a
serverless actions
Lithops runtime Lithops runtime
28. What Lithops good for
• Data pre-processing for ML / DL / AI Frameworks
• Batch processing, UDF, ETL, HPC and Monte Carlo simulations
• Embarrassingly parallel workload or problems - often the case where there is little or no dependency or
need for communication between parallel tasks
• Subset of map-reduce flows
Input Data
Results
………
Tasks 1 2 3 n
28
31. Serverless data pre-processing
•Majority of ML / DL / AI flows requires raw data to be pre-processed before
being consumed by the execution frameworks
•Examples
• Images persisted in the object storage.
User wants to extract colors and run
DL algorithms on the extracted colors
• Face alignment in images is usually
required as a first step before further
analysis
31
MBs KBs
32. Face alignment in images without Lithops
• import logging
• import os
• import sys
• import time
• import shutil
• import cv2
• from openface.align_dlib import AlignDlib
• logger = logging.getLogger(__name__)
• temp_dir = '/tmp'
• def preprocess_image(bucket, key, data_stream, storage_handler):
• """
• Detect face, align and crop :param input_path. Write output to :param output_path
• :param bucket: COS bucket
• :param key: COS key (object name ) - may contain delimiters
• :param storage_handler: can be used to read / write data from / into COS
• """
• crop_dim = 180
• #print("Process bucket {} key {}".format(bucket, key))
• sys.stdout.write(".")
• # key of the form /subdir1/../subdirN/file_name
• key_components = key.split('/')
• file_name = key_components[len(key_components)-1]
• input_path = temp_dir + '/' + file_name
• if not os.path.exists(temp_dir + '/' + 'output'):
• os.makedirs(temp_dir + '/' +'output')
• output_path = temp_dir + '/' +'output/' + file_name
• with open(input_path, 'wb') as localfile:
• shutil.copyfileobj(data_stream, localfile)
• exists = os.path.isfile(temp_dir + '/' +'shape_predictor_68_face_landmarks')
• if exists:
• pass;
• else:
• res = storage_handler.get_object(bucket, 'lfw/model/shape_predictor_68_face_landmarks.dat', stream = True)
• with open(temp_dir + '/' +'shape_predictor_68_face_landmarks', 'wb') as localfile:
• shutil.copyfileobj(res, localfile)
• align_dlib = AlignDlib(temp_dir + '/' +'shape_predictor_68_face_landmarks')
• image = _process_image(input_path, crop_dim, align_dlib)
• if image is not None:
• #print('Writing processed file: {}'.format(output_path))
• cv2.imwrite(output_path, image)
• f = open(output_path, "rb")
• processed_image_path = os.path.join('output',key)
• storage_handler.put_object(bucket, processed_image_path, f)
• os.remove(output_path)
• else:
• pass;
• #print("Skipping filename: {}".format(input_path))
• os.remove(input_path)
• def _process_image(filename, crop_dim, align_dlib):
• image = None
• aligned_image = None
• image = _buffer_image(filename)
• if image is not None:
• aligned_image = _align_image(image, crop_dim, align_dlib)
• else:
• raise IOError('Error buffering image: {}'.format(filename))
• return aligned_image
• def _buffer_image(filename):
• logger.debug('Reading image: {}'.format(filename))
• image = cv2.imread(filename, )
• image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
• return image
• def _align_image(image, crop_dim, align_dlib):
• bb = align_dlib.getLargestFaceBoundingBox(image)
• aligned = align_dlib.align(crop_dim, image, bb, landmarkIndices=AlignDlib.INNER_EYES_AND_BOTTOM_LIP)
• if aligned is not None:
• aligned = cv2.cvtColor(aligned, cv2.COLOR_BGR2RGB)
• return aligned
import ibm_boto3
import ibm_botocore
from ibm_botocore.client import Config
from ibm_botocore.credentials import DefaultTokenManager
t0 = time.time()
client_config = ibm_botocore.client.Config(signature_version='oauth',
max_pool_connections=200)
api_key = config['ibm_cos']['api_key']
token_manager = DefaultTokenManager(api_key_id=api_key)
cos_client = ibm_boto3.client('s3', token_manager=token_manager,
config=client_config, endpoint_url=config['ibm_cos']['endpoint'])
try:
paginator = cos_client.get_paginator('list_objects_v2')
page_iterator = paginator.paginate(Bucket="gilvdata", Prefix = 'lfw/test/images')
print (page_iterator)
except ibm_botocore.exceptions.ClientError as e:
print(e)
class StorageHandler:
def __init__(self, cos_client):
self.cos_client = cos_client
def get_object(self, bucket_name, key, stream=False, extra_get_args={}):
"""
Get object from COS with a key. Throws StorageNoSuchKeyError if the given key does not exist.
:param key: key of the object
:return: Data of the object
:rtype: str/bytes
"""
try:
r = self.cos_client.get_object(Bucket=bucket_name, Key=key, **extra_get_args)
if stream:
data = r['Body']
else:
data = r['Body'].read()
return data
except ibm_botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == "NoSuchKey":
raise StorageNoSuchKeyError(key)
else:
raise e
def put_object(self, bucket_name, key, data):
"""
Put an object in COS. Override the object if the key already exists.
:param key: key of the object.
:param data: data of the object
:type data: str/bytes
:return: None
"""
try:
res = self.cos_client.put_object(Bucket=bucket_name, Key=key, Body=data)
status = 'OK' if res['ResponseMetadata']['HTTPStatusCode'] == 200 else 'Error'
try:
log_msg='PUT Object {} size {} {}'.format(key, len(data), status)
logger.debug(log_msg)
except:
log_msg='PUT Object {} {}'.format(key, status)
logger.debug(log_msg)
except ibm_botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == "NoSuchKey":
raise StorageNoSuchKeyError(key)
else:
raise e
temp_dir = '/home/dsxuser/.tmp'
storage_client = StorageHandler(cos_client)
for page in page_iterator:
if 'Contents' in page:
for item in page['Contents']:
key = item['Key']
r = cos_client.get_object(Bucket='gilvdata', Key=key)
data = r['Body']
preprocess_image('gilvdata', key, data, storage_client)
t1 = time.time()
print("Execution completed in {} seconds".format(t1-t0))
Business Logic Boiler plate
• Loop over all images
• Close to 100 lines of “boiler
plate” code to find the
images, read and write the
objects, etc.
• Data scientist needs to be
familiar with S3 API
• Execution time
approximately 36 minutes for
1000 images!
32
33. Face alignment in images with Lithops
• import logging
• import os
• import sys
• import time
• import shutil
• import cv2
• from openface.align_dlib import AlignDlib
• logger = logging.getLogger(__name__)
• temp_dir = '/tmp'
• def preprocess_image(bucket, key, data_stream, storage_handler):
• """
• Detect face, align and crop :param input_path. Write output to :param output_path
• :param bucket: COS bucket
• :param key: COS key (object name ) - may contain delimiters
• :param storage_handler: can be used to read / write data from / into COS
• """
• crop_dim = 180
• #print("Process bucket {} key {}".format(bucket, key))
• sys.stdout.write(".")
• # key of the form /subdir1/../subdirN/file_name
• key_components = key.split('/')
• file_name = key_components[len(key_components)-1]
• input_path = temp_dir + '/' + file_name
• if not os.path.exists(temp_dir + '/' + 'output'):
• os.makedirs(temp_dir + '/' +'output')
• output_path = temp_dir + '/' +'output/' + file_name
• with open(input_path, 'wb') as localfile:
• shutil.copyfileobj(data_stream, localfile)
• exists = os.path.isfile(temp_dir + '/' +'shape_predictor_68_face_landmarks')
• if exists:
• pass;
• else:
• res = storage_handler.get_object(bucket, 'lfw/model/shape_predictor_68_face_landmarks.dat', stream = True)
• with open(temp_dir + '/' +'shape_predictor_68_face_landmarks', 'wb') as localfile:
• shutil.copyfileobj(res, localfile)
• align_dlib = AlignDlib(temp_dir + '/' +'shape_predictor_68_face_landmarks')
• image = _process_image(input_path, crop_dim, align_dlib)
• if image is not None:
• #print('Writing processed file: {}'.format(output_path))
• cv2.imwrite(output_path, image)
• f = open(output_path, "rb")
• processed_image_path = os.path.join('output',key)
• storage_handler.put_object(bucket, processed_image_path, f)
• os.remove(output_path)
• else:
• pass;
• #print("Skipping filename: {}".format(input_path))
• os.remove(input_path)
• def _process_image(filename, crop_dim, align_dlib):
• image = None
• aligned_image = None
• image = _buffer_image(filename)
• if image is not None:
• aligned_image = _align_image(image, crop_dim, align_dlib)
• else:
• raise IOError('Error buffering image: {}'.format(filename))
• return aligned_image
• def _buffer_image(filename):
• logger.debug('Reading image: {}'.format(filename))
• image = cv2.imread(filename, )
• image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
• return image
• def _align_image(image, crop_dim, align_dlib):
• bb = align_dlib.getLargestFaceBoundingBox(image)
• aligned = align_dlib.align(crop_dim, image, bb, landmarkIndices=AlignDlib.INNER_EYES_AND_BOTTOM_LIP)
• if aligned is not None:
• aligned = cv2.cvtColor(aligned, cv2.COLOR_BGR2RGB)
• return aligned
lt = lithops.FunctionExecutor()
bucket_name = 'gilvdata/lfw/test/images'
results = lt.map_reduce(preprocess_image, bucket_name, None, None).get_result()
• Under 3 lines of “boiler plate”!
• Data scientist does not need to
use s3 API!
• Execution time is 35s
• 35 seconds as compared to 36
minutes!
33
Business Logic Boiler plate
34. Demo – Color Identification of Images
• Our demo is based on the blog ”Color Identification in Images”, by Karan Bhanot
• We show how existing code from the blog can be executed at massive scale by Lithops
againt any compute backened without modifications to the original code
• Images of flowers stored in the object storage
• User wants to retrieve all images that contains “specific” color
• We demonstrate Lithops with with the backend based on K8s API
34
36. Behind the scenes
• Lithops inspects the input dataset in the object storage
• Generates DAG of the execution, mapping a single task to process a single image
• Lithops serialize user provided code, execution DAG and other internal metadata and upload all to object
storage
• Lithops generates ConfigMap and Job Definition in Code Engine, based on the provided Docker image (or
ses default if none provided)
• Lithops submit an array job that is mapped to the generated execution DAG
• Each task contains Lithops runtime and will pull relevant entry from the execution DAG
• Once task completed, status and results are persisted in object storage
• When array job is completed, Lithops reads the results from the object storage
36
37. The user experience
Kubernetes
deployment
definitions
Job description
Coordination
User application
Today’s state
User responsible for deployment
Configuration and management
Lithops framework
input_data = array, COS, storage, DBs, etc.
def my_func(params):
//user software stack
results = f.get_result()
import lithops as lt
f=lt.FunctionExecutor(backend=“code_engine”)
f.map(my_func, input_data, (params))
User only focus on biz/science logic
Deployment completed abstracted
37
39. Spatial metabolomics and the Big Data challenge
• Spatial metabolomics, or detection of metabolites in cells,
tissue, and organs, is the current frontier in the field of
understanding our health and disease in particular in cancer
and immunity.
• The process generates a lot of data since every pixel in a
medical image can be considered as a sample containing
thousands of molecules and the number of pixels can reach as
high as a million thus putting as-high-as-ever requirements to
the algorithms for the data analysis.
EMBL develops novel computational biology tools to reveal
the spatial organization of metabolic processes
39
40. 40
Scientist uploads
dataset
(1GB-1TB)
Provides metadata,
selects parameters,
chooses molecular
DB
Data preparation for
parallel analytics
Screening for
molecules
Molecular
visualization,
analysis, sharing
METASCPASE workflow with Lithops
Completely serverless
de-centralized architecture
Optimal scale automatically
defined at run time
Backend selection optimized
for cost/performance
https://github.com/metaspace2020/Lithops-METASPACE
In the demo Lithop uses Apache OpenWhisk API
42. Monte Carlo and Lithops
• Very popular in financial sector
• Risk and uncertainty analysis
• Molecular biology
• Sport, gaming, gambling
• Weather models and many more…
Lithops is natural fit to scale Monte Carlo computations across FaaS platform
User need to write business logic and Lithops does the rest
42
45. Protein Folding
45
• Proteins are biological polymers that carry out most of
the cell’s day-to-day functions.
• Protein structure leads to protein function
• Proteins are made from a linear chain of amino acids
and folded into variety of 3-D shapes
• Protein folding is a complex process
that is not yet completely understood
46. Replica exchange
• Monte Carlo simulations are popular methods to predict protein folding
• ProtoMol is special designed framework for molecular dynamics
• http://protomol.sourceforge.net
• A highly parallel replica exchange molecular dynamics (REMD) method used to
exchange Monte Carlo process for efficient sampling
• A series of tasks (replicas) are run in parallel at various temperatures
• From time to time the configurations of neighboring tasks are exchanged
• Various HPC frameworks allows to run Protein Folding
• Depends on MPI
• VMs or dedicated HPC machines
46
47. Protein folding with Lithops
47
Lithops submit a job of
X invocations
each running ProtoMol
Lithops collect
results of all
invocations
REMD algorithms uses output
of invocations as an input to
the next job
Each invocation runs
ProtoMol (or GROMACS)
library to run Monte Carlo
simulations
*
Our experiment – 99 jobs
• Each job executes many invocations
• Each invocation runs 100 Monte Carlo steps
• Each step running 10000 Molecular Dynamic steps
• REMD exchange the results of the completed job
which used as an input to the following job
• Our approach doesn’t use MPI
“Bringing scaling transparency to Proteomics applications with serverless computing"
WoSC'20: Proceedings of the 2020 Sixth International Workshop on Serverless Computing
December 2020 Pages 55–60, https://doi.org/10.1145/3429880.3430101
48. Summary
• Serverless computing provides unlimited resources and is very attractive compute platform
• The move to serveless may be challenging for certain scenarios
• Lithops is an open source framework, designed for “Push to the Cloud” experience
• We saw demos and use cases
• All material and code presented in this talk are an open source
• For more details visit
Thank you
Gil Vernik
gilv@il.ibm.com
48
http://lithops.cloud