15. End-to-End Deep Learning on
Unstructured Data
Training
Set
Extracted
Training
Set
model Video
Extract face storlet Train model storlet
X10
0
Test
set
Recognize face storlet
16. Demo Setup: S2AIO with Jupyter Notebook
Swift and
Storlets all
in one
17. Local Scripts & S3 Vs. S2AIO
Swift and
Storlets all
in oneS3
S3 Client
With OpenCV
and SKLearn
18. Local Scripts & S3 Vs. S2AIO
Swift and
Storlets all
in oneS3
S3 Client
With OpenCV
and SKLearn
Dedicated M4.2XLarge (8 CPUs 32GB RAM)
19. S2aio on EC2 Vs. EC2/S3
Dedicated M4X2Large (8 VCPUs, 32GB Ram, High Network Performance)
0
10
20
30
40
50
60
70
Extract Train Recognize
Seconds
EC2 Swift & Storlets
EC2 & S3
23. Thank You!
All Demo Code: https://github.com/eranr/e2emlstorlets
My Blog: http://itsonlyme.name/blog
Editor's Notes
Storlets are about co-locating storage and compute. That is, instead of bringing the data to the compute, bring the compute, which is much smaller, to the data.
The Stork is the Storlets project mascot
More specifically, storlets allow to co-locate Dockerized computations inside Openstack Swift in a serverless fashion
Swift is a massively scalable storage system that has a simple API to store and retrieve data blobs taking care of data redundancy via e.g. replication across failure domains.
We use Docker to run the compute near the data in a secured and isolated manner.
By serverless we mean that an end user can upload to Swift the program to run as done for any other data blob, and we will take care of the rest.
This is what I refer to as a data centric hyper convergence. Like traditional hyperconvergence the idea is to have a storage compute and networking solution that can horizontally scale.
Traditional hyperconvergence though is focused on general purpose virtual environments and many times go hand in hand with high end flash arrays. This is being marketed as
A solution for big data analytics over semi-structure data. Here we are focusing on unstructured data, which is the majority of the data.
Hyperconvergence and data centric hyper convergence are complimentary technologies where one can think of the data centric part as ‘transforming’ the unstructured data to semi-structured data that can be consumed with traditional big data machinery.
As such I think that data centric hyperconvergence should also have a data management component in the mix, e.g. metadata search.
The graph shows the growth factor of a single SSD/HDD and single networking ports.
In Ethernet we see growth from 10Gb in 2010 to a 100 in 2014. Today we start seeing 200Gb
Infiniband started at 56Gb in 2011 and like ethernet were in 100 in 2014 and are now at 200
HDD were not growing as fast, with the X8 factor due to Helium filled HDDs
In SSDs, however, we see a really big growth, with Seagate announcing 60TB drive last year, and Toshiba 100TB drive to come out this year.
Now, consider that in a typical storage server there are much more disks then network ports…
Considering a 16 disks server with 4 network ports we see a much bigger difference in the growth factor.