Manta Unleashed BigDataSG talk 2 July 2013
Upcoming SlideShare
Loading in...5
×
 

Manta Unleashed BigDataSG talk 2 July 2013

on

  • 2,713 views

A walk-through of Joyent's Manta platform on SmartOS that explains how the illumos innovations of zones, zfs and Node.js led to the development of the Manta Object Store. Examples, primary manta ...

A walk-through of Joyent's Manta platform on SmartOS that explains how the illumos innovations of zones, zfs and Node.js led to the development of the Manta Object Store. Examples, primary manta commands and simple use-cases are provided to start using Manta to analyze Big Data in with any arbitrary Unix/Posix code without moving the data.

Statistics

Views

Total Views
2,713
Views on SlideShare
2,507
Embed Views
206

Actions

Likes
3
Downloads
16
Comments
0

4 Embeds 206

http://www.scoop.it 119
https://twitter.com 74
http://192.168.6.179 11
http://kred.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Manta Unleashed BigDataSG talk 2 July 2013 Manta Unleashed BigDataSG talk 2 July 2013 Presentation Transcript

  • Manta Unleashed BigDataSg Meetup 2 July 2013 Christopher W. V. Hogue Ph.D. chogue@blueprint.org
  • Big Data in 2002 – NBLAST - Computing 361,249,575,000 Protein Sequence Alignments & storing significant hits http://www.biomedcentral.com/content/pdf/1471-2105-3-13.pdf
  • Big Data in 2003 – Distributed Computing, Tiered Architecture for 10 Billion Protein 3D structure samples Volunteer Computing Blueprint Data Center
  • What is Manta? • Manta is a new operating-system level component of the IaaS platform of Joyent released June 26 2013. http://www.joyent.com • Manta is an object store system for big-data that you can compute on without moving your data • Manta provides map-reduce capability for executing POSIX standard, arbitrary compute jobs directly on cloud storage servers • Manta allows map-reduce operations formed by any standard UNIX command or application in any run-time language without moving stored data without Hadoop or Java code without loading raw data into a database
  • What Operating System? • Manta is built on SmartOS, using the illumos kernel, which is open-source UNIX • SmartOS is Not GNU/LINUX • SmartOS is a very lightweight illumos distro for cloud hypervisors with KVM and storage that runs in RAM from PXE/CD/USB boot media • Derived from Sun Microsystem’s Open Solaris • Over 10,000 packages supported via pkgsrc system
  • illumos is the Open Source Unix kernel forked from Solaris Cloud OS Server OS Storage OS Kernel DTrace Crossbow Zones ZFS SMF MDB CDDL Oracle closed its Solaris source… Aug 2010 Database OS Jan 2010 and more… Kernel Innovations Bugfixes GCC build ZFS feature flags ZFS background delete ZFS LZ4 compression KVM Type 1 hypervisor UNIX System V Release 4 Four years of legal work to open-source Solaris. 2004-2008 1992
  • Manta – What is SmartOS? • SmartOS is Joyent’s lightweight illumos kernel based operating system optimized for high-performance cloud computing. • illumos is an open-source fork of Open Solaris, supported by Joyent, Nexenta, OmniTI, DEY systems, and Delphix and other core committers. • After Oracle bought Sun Microsystems, many Solaris software engineers, those who built ZFS, Dtrace and other components, left Oracle and joined the illumos effort. • illumos distros that you can experiment with include SmartOS, OmniOS, OpenIndiana, and NexentaStor. • Prerequisite for Manta Use: Your code needs to run/tested on (x86) illumos!
  • • Started in 2004 • IaaS hosting: – Windows, Linux, FreeBSD KVM images – LinkedIn , Wanelo, Voxer, Storify, Geeklist, Tripshare … many others – Singapore’s Reebonz (reebonz.com.sg) • 4 Primary Data Centers -> • 3rd Party Smart Data Center Licensees who run Joyent-Powered Clouds, e.g.: – Telefonica – Spain – http://cloud.telefonica.com/instantservers/ – MiCloud – Taiwan – http://micloud.tw/ch/ – Libero – Italy – http://cloud.libero.it/it/il_nostro_cloud/profilo/ http://www.joyent.com/products/compute-service/data-centers • Class-1 DC Operators • SSAE 16 Certified • Multi-layered Physical Security • Highly-Redundant Power • Early Warning Fire Suppression • All Tier-1 ISP Connectivity • 10gb/40gb Fully-Meshed Network • Full Peering, Fiber Connectivity May 20 2013 – Dell drops Open Stack Cloud, Partners with Joyent for high-performance, high- availability IaaS service provision.
  • Joyent as an IaaS provider • Has full development control of the entire operating system stack • Is the corporate steward of the Node.js Javascript run-time language • Community friendly - provides SmartOS image downloads, source for free, and support • You can deploy a private cloud for free with 3rd party management software “Project Fifo”
  • SmartOS Storage Implementation • All SmartOS storage is local, on ZFS – Integrated disk/volume management – Copy-on-write – Self-healing – Protection against silent data corruption – No hardware RAID dependency – Striping, RAID-Z with no write hole – No fsck resilvering – Built-in filesystem compression options – Compress a subdirectory – Snapshots – zfs send / receive – Integrated SSD IO caching – Add drives with one command, while in production • Manta in the Joyent Datacenter is built on ZFS – no SAN, no NAS head nodes – no tiered layers – standard commodity Intel servers – 4 U servers with 73 TiB of user data – basic SAS HBA technology – Every object is stored on 2 ZFS pools by policy default, local to the server on which it is accessed – Architecture leads me to speculate that Manta stands for “Manta is Not Tiered Architecture”
  • Manta Features • A multi-datacenter object store • Fine-grained replication commands • No object size limits • Per-object replication policies • Filesystem-like namespace including directory queries • Up to 1 million files per directory • Public folders for CDN data delivery • Read-after write consistency • SnapLinks – a file hard-link (ln) and snapshot mash- up, allowing alternate file naming and versioning in place. Use to mimic data movement. • REST with JSON API • Interactive shell access through Node.js driven SDK and commands • Compute in place with map-reduce processing with arbitrary code and scripts without data movement • GuardTime keyless data signatures and validation
  • Manta’s Compute-on-Storage • On AWS E3 • Move the “big data” into – EC2 – Hadoop • Then orchestrate a method to run the query • Then clean up additional big data instances • On Manta • grep in place on the storage servers • Manta hands back your job output in a new folder For a simple grep style text query in a big-data collection of server logs:
  • How does Manta work? End User • Install Node.js package with mlogin() and local Manta commands • Local Node.js environment includes Manta interactive shell and fast I/O data and command transfers up to the Manta Data Center . • Commands transit via REST APIs with JSON encoding. These can be called directly. Data Center • Connects to End User • Distributes and commits data uploads according to replication policy (2 by default) • Fast consistency, data is ready to use without waiting for synch • Jobs are launched in SmartOS Zone VM images on the server. • The hashed UID of the Zone that is launched becomes the job number/directory for output data
  • Manta Commands Client-Side Utilities Installed locally as Joyent Manta Node.js SDK. Also available to your jobs on the data-side -- -> • mls - Lists directory contents • mput - Uploads data to an object • mget - Downloads an object from the service • mjob - Creates and runs a computational job on the service • mfind - Walks a hierarchy to find names of objects by name, size, or type • mlogin - Interactive session client • mln - Makes SnapLinks between objects • mmkdir - Make directories • mrm - Remove objects or directories • mrmdir - Remove empty directories • msign - Create a signed URL to a object stored in the service • muntar - Create a directory hierarchy from a tar file Data-Side Utilities Additional commands are available to your jobs in the data- side compute environment: • maggr - Performs key-wise aggregation on plain text files. • mcat - Emits the named object as an output for the current task. • mpipe - Output pipe for the current task. • msplit - Split the output stream for the current task to many reducers. • mtee - Capture stdin and write to both stdout and a object. Control interactively via shell-like SDK, OR automate with REST + JSON APIs.
  • Manta patterns for job creation • $ mjob create –m ’command-to- map’ –r ‘command to reduce’ • Big Data Map Reduce version of grep: – (GNU grep –H prints name of file matching pattern, so you know what file is matched) • $ mjob create -m ’grep -H --label=$MANTA_INPUT_OBJECT pattern’ -r cat http://apidocs.joyent.com/manta/job-patterns.html
  • Manta Documentation – Total Word Count in text file collection with map-reduce of wc + awk 1-liner Interactive REST + JSON API
  • Manta Documentation – Image conversion with ImageMagik “convert”
  • What software can I run on Manta? Thousands of ready to use UNIX packages on the VM image: • Python • Perl • R • Node.js • Java • ImageMagik • ffmpeg • OpenSSL • Sqlite • MySQL client • Postgres client Or run custom software that is not on the VM image: • These are called Assets • Can be interpretable code or SmartOS compatible binaries • Upload a SmartOS compatible package (e.g. tarball as tgz or a script file) on Manta • Use a job script that unpacks the custom asset inside the Manta VM, and executes it. • Use standard Unix approaches for error loging, output, pipes and tees.
  • Use Cases • Democratization of BIG DATA – No longer in the hands of a few • Mass market self-logging devices – Transportation/Automotive – E-health monitoring systems – Sensor networks • Scientific paper PDF collections – Federate collections – Allow large scale text mining • Genomic Sequence Analysis – Store Raw Data – Move compute pipeline to data – Meta-pipelines in parallel for computing over old data with new knowledge • Running a checksum over your data to assure its integrity • Log processing: clickstream analysis, MapReduce on logs • Text processing including search • Image processing: converting formats, generating thumbnails, resizing • Video processing: transcoding, extracting segments, resizing • Data Analysis, Mining and Graphing with NumPy, SciPy and R
  • Manta Pricing http://www.joyent.com/products/manta/pricing Manta compute charges are by the second: $0.00004/GB DRAM * sec If you run 1000 parallel tasks in 32GB DRAM instances on 1000 objects and they each take a second, then you've used 32000 seconds of time and the cost for this job would be $1.28. Storage charges are slightly less than Amazon E3 Bandwidth IN is free Bandwith OUT has tiered charges. Request Type Price per unit of requests Delete Free POST, PUT, LIST (“GET DIR”) $0.005/1000 requests GET, OPTION, HEAD $0.004/10000 requests Storage Tier Default (2 copies)Price per GB (per individual copy) First 1 TB/mo $0.086 $0.043 Next 49 TB/mo $0.072 $0.036 Next 450 TB/mo $0.064 $0.032 Next 500 TB/mo $0.058 $0.029 Next 4000 TB/mo$0.054 $0.027 Next 5000 TB/mo$0.050 $0.025 Default is 2 copies. When submitting an object to the service, you can specify the number of copies stored, from one (1) to six (6).
  • Deploy a Fast, Scalable, Free, Open Source Private IaaS Cloud Today. • SmartOS http://smartos.org/ • Project FiFO http://project-fifo.net My PXE boot 2-node desktop IaaS Cloud setup Fifo Web Console managing SmartOS KVM Type 1 (bare metal) Hypervisor