A walk-through of Joyent's Manta platform on SmartOS that explains how the illumos innovations of zones, zfs and Node.js led to the development of the Manta Object Store. Examples, primary manta commands and simple use-cases are provided to start using Manta to analyze Big Data in with any arbitrary Unix/Posix code without moving the data.
2. Big Data in 2002 – NBLAST - Computing 361,249,575,000
Protein Sequence Alignments & storing significant hits
http://www.biomedcentral.com/content/pdf/1471-2105-3-13.pdf
3. Big Data in 2003 – Distributed Computing, Tiered Architecture
for 10 Billion Protein 3D structure samples
Volunteer Computing
Blueprint Data Center
4. What is Manta?
• Manta is a new operating-system level component
of the IaaS platform of Joyent released June 26 2013.
http://www.joyent.com
• Manta is an object store system for big-data that you can
compute on without moving your data
• Manta provides map-reduce capability for executing POSIX
standard, arbitrary compute jobs directly on cloud storage
servers
• Manta allows map-reduce operations
formed by any standard UNIX command or application
in any run-time language
without moving stored data
without Hadoop or Java code
without loading raw data into a database
5. What Operating System?
• Manta is built on SmartOS, using the illumos kernel, which
is open-source UNIX
• SmartOS is Not GNU/LINUX
• SmartOS is a very lightweight illumos
distro for cloud hypervisors with KVM and storage
that runs in RAM from PXE/CD/USB boot media
• Derived from Sun Microsystem’s Open Solaris
• Over 10,000 packages supported via pkgsrc system
6. illumos is the Open Source Unix kernel forked from Solaris
Cloud OS
Server OS
Storage OS
Kernel
DTrace
Crossbow
Zones
ZFS
SMF
MDB CDDL
Oracle closed its Solaris source…
Aug
2010
Database OS
Jan
2010
and more…
Kernel Innovations
Bugfixes
GCC build
ZFS feature flags
ZFS background delete
ZFS LZ4 compression
KVM Type 1 hypervisor
UNIX System V
Release 4
Four years of legal work
to open-source Solaris.
2004-2008
1992
7. Manta – What is SmartOS?
• SmartOS is Joyent’s lightweight illumos kernel based operating system
optimized for high-performance cloud computing.
• illumos is an open-source fork of Open Solaris, supported by Joyent,
Nexenta, OmniTI, DEY systems, and Delphix and other core committers.
• After Oracle bought Sun Microsystems, many Solaris software engineers,
those who built ZFS, Dtrace and other components, left Oracle and joined
the illumos effort.
• illumos distros that you can experiment with include SmartOS, OmniOS,
OpenIndiana, and NexentaStor.
• Prerequisite for Manta Use:
Your code needs to run/tested on (x86) illumos!
8. • Started in 2004
• IaaS hosting:
– Windows, Linux, FreeBSD KVM images
– LinkedIn , Wanelo, Voxer, Storify, Geeklist,
Tripshare …
many others
– Singapore’s Reebonz (reebonz.com.sg)
• 4 Primary Data Centers ->
• 3rd Party Smart Data Center Licensees
who run Joyent-Powered Clouds, e.g.:
– Telefonica – Spain
– http://cloud.telefonica.com/instantservers/
– MiCloud – Taiwan
– http://micloud.tw/ch/
– Libero – Italy
– http://cloud.libero.it/it/il_nostro_cloud/profilo/
http://www.joyent.com/products/compute-service/data-centers
• Class-1 DC Operators
• SSAE 16 Certified
• Multi-layered Physical Security
• Highly-Redundant Power
• Early Warning Fire Suppression
• All Tier-1 ISP Connectivity
• 10gb/40gb Fully-Meshed Network
• Full Peering, Fiber Connectivity
May 20 2013 – Dell drops Open Stack Cloud,
Partners with Joyent for high-performance, high-
availability IaaS service provision.
9. Joyent as an IaaS provider
• Has full development control of the entire operating
system stack
• Is the corporate steward of the
Node.js Javascript run-time language
• Community friendly - provides SmartOS image
downloads, source for free, and support
• You can deploy a private cloud for free with 3rd party
management software “Project Fifo”
10. SmartOS Storage Implementation
• All SmartOS storage is local,
on ZFS
– Integrated disk/volume management
– Copy-on-write
– Self-healing
– Protection against silent data corruption
– No hardware RAID dependency
– Striping, RAID-Z with no write hole
– No fsck resilvering
– Built-in filesystem compression options
– Compress a subdirectory
– Snapshots
– zfs send / receive
– Integrated SSD IO caching
– Add drives with one command, while in
production
• Manta in the Joyent Datacenter
is built on ZFS
– no SAN, no NAS head nodes
– no tiered layers
– standard commodity Intel servers
– 4 U servers with 73 TiB of user data
– basic SAS HBA technology
– Every object is stored on 2 ZFS pools by
policy default, local to the server on
which it is accessed
– Architecture leads me to speculate that
Manta stands for
“Manta is Not Tiered Architecture”
11. Manta Features
• A multi-datacenter object store
• Fine-grained replication commands
• No object size limits
• Per-object replication policies
• Filesystem-like namespace including directory
queries
• Up to 1 million files per directory
• Public folders for CDN data delivery
• Read-after write consistency
• SnapLinks – a file hard-link (ln) and snapshot mash-
up, allowing alternate file naming and versioning in
place. Use to mimic data movement.
• REST with JSON API
• Interactive shell access through Node.js driven SDK
and commands
• Compute in place with map-reduce processing with
arbitrary code and scripts without data movement
• GuardTime keyless data signatures and validation
12. Manta’s Compute-on-Storage
• On AWS E3
• Move the “big data” into
– EC2
– Hadoop
• Then orchestrate a
method to run the query
• Then clean up additional
big data instances
• On Manta
• grep in place on the
storage servers
• Manta hands back your
job output in a new folder
For a simple grep style text query
in a big-data collection of server logs:
13. How does Manta work?
End User
• Install Node.js package with
mlogin() and local Manta
commands
• Local Node.js environment
includes Manta interactive
shell and fast I/O data and
command transfers up to the
Manta Data Center .
• Commands transit via REST
APIs with JSON encoding.
These can be called directly.
Data Center
• Connects to End User
• Distributes and commits data
uploads according to
replication policy (2 by default)
• Fast consistency, data is ready
to use without waiting for
synch
• Jobs are launched in SmartOS
Zone VM images on the server.
• The hashed UID of the Zone
that is launched becomes the
job number/directory for
output data
14. Manta Commands
Client-Side Utilities
Installed locally as Joyent Manta Node.js SDK.
Also available to your jobs on the data-side -- ->
• mls - Lists directory contents
• mput - Uploads data to an object
• mget - Downloads an object from the service
• mjob - Creates and runs a computational job on the
service
• mfind - Walks a hierarchy to find names of objects
by name, size, or type
• mlogin - Interactive session client
• mln - Makes SnapLinks between objects
• mmkdir - Make directories
• mrm - Remove objects or directories
• mrmdir - Remove empty directories
• msign - Create a signed URL to a object stored in
the service
• muntar - Create a directory hierarchy from a tar file
Data-Side Utilities
Additional commands are available to your jobs in the data-
side compute environment:
• maggr - Performs key-wise aggregation on plain text
files.
• mcat - Emits the named object as an output for the
current task.
• mpipe - Output pipe for the current task.
• msplit - Split the output stream for the current task to
many reducers.
• mtee - Capture stdin and write to both stdout and a
object.
Control interactively via shell-like SDK, OR automate with REST + JSON APIs.
15. Manta patterns for job creation
• $ mjob create –m ’command-to-
map’ –r ‘command to reduce’
• Big Data Map Reduce version of grep:
– (GNU grep –H prints name of file matching pattern, so you know what file is matched)
• $ mjob create -m ’grep -H --label=$MANTA_INPUT_OBJECT
pattern’ -r cat
http://apidocs.joyent.com/manta/job-patterns.html
16. Manta Documentation – Total Word Count in text file
collection with map-reduce of wc + awk 1-liner
Interactive
REST + JSON API
18. What software can I run on Manta?
Thousands of ready to use UNIX packages
on the VM image:
• Python
• Perl
• R
• Node.js
• Java
• ImageMagik
• ffmpeg
• OpenSSL
• Sqlite
• MySQL client
• Postgres client
Or run custom software that is
not on the VM image:
• These are called Assets
• Can be interpretable code or SmartOS
compatible binaries
• Upload a SmartOS compatible package
(e.g. tarball as tgz or a script file) on
Manta
• Use a job script that unpacks the custom
asset inside the Manta VM, and executes
it.
• Use standard Unix approaches for error
loging, output, pipes and tees.
19. Use Cases
• Democratization of BIG DATA
– No longer in the hands of a few
• Mass market self-logging devices
– Transportation/Automotive
– E-health monitoring systems
– Sensor networks
• Scientific paper PDF collections
– Federate collections
– Allow large scale text mining
• Genomic Sequence Analysis
– Store Raw Data
– Move compute pipeline to data
– Meta-pipelines in parallel for computing
over old data with new knowledge
• Running a checksum over your data
to assure its integrity
• Log processing: clickstream analysis,
MapReduce on logs
• Text processing including search
• Image processing: converting
formats, generating thumbnails,
resizing
• Video processing: transcoding,
extracting segments, resizing
• Data Analysis, Mining and Graphing
with NumPy, SciPy and R
20. Manta Pricing http://www.joyent.com/products/manta/pricing
Manta compute charges
are by the second:
$0.00004/GB DRAM * sec
If you run 1000 parallel tasks in 32GB
DRAM instances on 1000 objects and
they each take a second, then you've
used 32000 seconds of time and the cost
for this job would be $1.28.
Storage charges are slightly less than
Amazon E3
Bandwidth IN is free
Bandwith OUT has tiered charges.
Request Type Price per unit of requests
Delete Free
POST, PUT, LIST (“GET DIR”) $0.005/1000 requests
GET, OPTION, HEAD $0.004/10000 requests
Storage Tier Default (2 copies)Price per GB (per individual
copy)
First 1 TB/mo $0.086 $0.043
Next 49 TB/mo $0.072 $0.036
Next 450 TB/mo $0.064 $0.032
Next 500 TB/mo $0.058 $0.029
Next 4000 TB/mo$0.054 $0.027
Next 5000 TB/mo$0.050 $0.025
Default is 2 copies. When submitting an object to the service,
you can specify the number of copies stored, from one (1) to six
(6).
21. Deploy a Fast, Scalable, Free, Open Source
Private IaaS Cloud Today.
• SmartOS
http://smartos.org/
• Project FiFO
http://project-fifo.net
My PXE boot 2-node
desktop IaaS Cloud setup
Fifo Web Console managing SmartOS
KVM Type 1 (bare metal) Hypervisor