This crash course is designed to give an overview of cloud computing architecture and the open source software that can be used to deploy and manage a cloud computing environment.
Topics to be discussed in this session will include virtualization (KVM, LXC, and Xen Project), orchestration (Apache CloudStack, Eucalyptus, Open Nebula, and OpenStack), and storage (GlusterFS, Ceph, and others). The talk will also provide insight into how to deliver Platform-as-a-Service (PaaS) and what technologies can be used to compliment this evolving cloud computing paradigm.
Systems administrators and IT generalists will leave the discussion with a general overview of the options at their disposal to effectively build and manage their own cloud computing environments using free and open source software and understand the capabilities and benefits of a host of technologies.
OSCON 2014 - Crash Course in Open Source Cloud Computing
1. Mark Hinkle
Senior Director, Open Source Solutions
Citrix Inc.
mark.hinkle@citrix.com
mrhinkle@gmail.com
@mrhinkle
Last updated: 7/20/2014
Crash Course
In Open Source
Cloud
Computing
2. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
ABOUT ME
I Help Build Open Source Ecosystems
Open Source Experience
• Manage Citrix Open Source Business Office
• Apache CloudStack Committer and PMC Member
• Advisory boards Gluster and Xen Project
• Joined Citrix via Cloud.com acquisition July 2011
• Zenoss Core open source project to 100,000 users,
1.5 million downloads
• Former LinuxWorld Magazine Editor-in-Chief
• Open Management Consortium organizer
• Author - “Windows to Linux Business Desktop
Migration” – Thomson
• NetDirector Project - Open Source Configuration
Management
3. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
Slides Available on Slideshare:
http://www.slideshare.net/socializedsoftwar
e
Creative Commons Attributions-ShareAlike 4.0 International
Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material
for any purpose, even commercially.
The licensor cannot revoke these freedoms as long as you follow the license terms.
Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made.
You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the
same license as the original.
4. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
AGENDA
• Vetting Open Source Cloud Projects
• What is Cloud in 60 Seconds
• Virtualization
• Infrastructure-as-a-Service
• SDN
• Open Source for the Amazon Web Services
5. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
VETTING OPEN SOURCE
PROJECTSHow can you tell if they’re Legit
• Code Velocity
• Committers
• Committer Reputation
• User-driven or Vendor-Driven
Innovation
• User Activity
• Corporate Support*
• Reputation of Foundation*
6. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
OPEN SOURCE ANALYSIS
Visualizing Community Activity
http://www.ohloh.net http://activity.openstack.org
7. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
60 SECOND CLOUD DEFINITION
5 CHARACTERISTICS OF CLOUD
1. On-Demand Self-Service
2. Broad Network Access
3. Resource Pooling
4. Rapid Elasticity
5. Measured Service
User Cloud a.k.a.
SOFTWARE-AS-A-SERVICE
Developer Cloud a.k.a.
PLATFORM-AS-A-SERVICE
Systems Cloud a.k.a.
INFRASTRUCTURE-AS-A-
SERVICE
Just because Software Marketing GuysThink it’s the Internet
8. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
Vertical Scaling (Scale-Up)
Allocate additional resources to
VMs, requires a reboot, no need for
distributed app logic, single-point of
OS failure
Horizontal Scaling (Scale-Out)
Application needs logic to work in
distributed fashion (e.g. HA-Proxy
and Apache Hadoop)
SCALE-UP SCALE OUT
Elasticity and the cloud
9. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
HYPERVISORS AND CONTAINERS
Differences in virtualization
Type 1 Hypervisors
VMware, Xen Project, Hyper-V
Type 2 Hypervisors
KVM, VirtualBox
Containers
LXC
10. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
VIRTUALIZATION
Carving up compute resources
OPEN SOURCE
• Xen Project
• Citrix XenServer
• KVM
• VirtualBox
• OpenVZ
• LXC
PROPRIETARY
• VMware
• Microsoft Hyper-V
• OracleVM (Based on Xen Project)
11. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
OPEN VIRTUALIZATION FORMATS
Virtualization Payloads
Open Virtualization
Format (OVF) is an
open standard for
packaging and
distributing virtual
appliances or more
generally software to
be run in virtual
machines.
Formats for hypervisors/cloud
technologies:
• Amazon - AMI
• KVM – QCOW2
• VMware – VMDK
• Xen Project– IMG
• Hyper-V - VHD – Virtual Hard Disk
• LXC – local file system/mount point -
Docker*
12. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
LINUX CONTAINERS (LXC)
“Lightweight” LinuxVirtualization
• Lets your run a Linux system within
another Linux system
• A container is a group of processes on a
Linux box, put together the provide an
isolated environment
• From the inside, it looks like a VM
• Externally it looks like normal processes
• “chroot on steroids”
13. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
THE PORTABILITY PROBLEM
Containers compared to Hardware Virtualization
• Different file formats for virtual machines
• VMware uses vmdk file format, Xen and Hyper-
V use VHD, KVM uses Raw or QCOW2
• Guest images may be “processor architecture”
bound
• VMware and Xen can manage SCSI devices, but
KVM cannot
• KVM and Xen can use virtio drivers but not
VMware
• VMware uses a proprietary agent inside the
guest OS (VMware tools) which does not work
with Xen or KVM
• Xen uses VirtIo and ParaVirtualized drivers, Xen
uses
14. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
CONTINUOUS INTEGRATION
Rebuild Applications on any Cloud and/or Virtualized Infrastructure
• Code – Application is stored
in a repository
(Subversion,Git)
• Build – Code is built (Jenkins)
• Test – Unit tests are
automated (Jenkins)
• Deploy – Deploy code to
server various ways
Code
Build
Test
Deploy
Thoughtworks Go – Open Source
Continuous Deliver System
15. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
DOCKER CONTAINER PACKAGING
Open source LXC Packaging Engine
Docker is an open-source project to easily
create lightweight, portable, self-sufficient
containers from any application. The same
container that a developer builds and tests
on a laptop can run at scale, in production,
on VMs, bare metal, public clouds and
more.
To learn more please visit:
www.docker.io
16. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
WHAT IS DOCKER
System for Managing and Deploying LXC Containers
• Compliment to LXC not a replacement
• Managed daemonized processes on Linux
using LXC
• Create ability to re-use and manage similar
applications
• Content agnostic
• Hardware agnostic
• Easy to automate
• Integrated with other tools: Chef, OpenShift,
Puppet, VMware, etc.
17. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
KUBERNETES
Container Cluster Management – Scheduler
Kubernetes builds on top of Docker to
construct a clustered container scheduling
service. Kubernetes enables users to ask
a cluster to run a set of containers. The
system will automatically pick worker
nodes to run those containers on, which
we think of more as "scheduling" than
"orchestration”
To learn more please visit:
https://github.com/GoogleCloudPlatform/kubernetesGreek for Shipmaster
18. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
APACHE MESOS
One to many tools for managing large numbers of devices
Apache Mesos is a cluster manager that simplifies the
complexity of running applications on a shared pool of
servers. Largely supported by Twitter, used by LinkedIn,
AirBNB too.
Features
• Fault-tolerant replicated master using ZooKeeper
• Scalability to 10,000s of nodes
• Isolation between tasks with Linux Containers
• Multi-resource scheduling (memory and CPU aware)
• Java, Python and C++ APIs for developing new
parallel applications
• Web UI for viewing cluster state
To learn more please visit:
http://mesos.apache.org/
19. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
19
APACHE ZOOKEEPER
Centralized Server to Service Distributed Apps
ZooKeeper is a centralized service for
maintaining configuration information,
naming, providing distributed
synchronization, and providing group
services. All of these kinds of services
are used in some form or another by
distributed applications
To learn more please visit:
http://zookeeper.apache.org/
20. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
INFRASTRUCTURE-AS-A-SERVICE
Compute Orchestration
Project Year Started License Virtualization
Technologies
Apache
CloudStack
2008 Apache (Bare Metal), Xenserver,
KVM, LXC VMware Hyper-
V
Eucalyptus 2006 GPL Xen, KVM, VMware
(commercial version)
OpenNebula 2005 Apache Xen, KVM, VMware
OpenStack 2010 (Developed by
NASA by Anso Labs
previously)
Apache VMware ESX and ESXi, ,
Xen, XenServer, KVM,
LXC, QEMU and Virtual
Box
21. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
OPENSTACK
The Boy Band of the Open Source Cloud
22. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
OPENSTACK SHARED
SERVICESSpan Compute, Storage and Networking
IDENTITY
SERVICE
IMAGE
SERVICE
TELEMETRY
SERVICE
ORCHESTRATION
SERVICE
23. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
EVEN MORE OPENSTACK PROJECTS
Span Compute, Storage and Networking
• Cinder
Block Storage Service
• Ceilometer
Metering/Monitoring
• Heat
Orchestration
• Trove
Database Service
• Ironic
Bare Metal (Ironic)
• Marconi
Queue Service
24. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
OPENSTACK SOLUTION PROVIDERS
If you can’t do it yourself
“OpenStack is not a product. If you are building a large infrastructure, it’s
more like a tool kit. It gives you a lot of technologies that do take a lot of
effort to integrate.”
Chris Kemp, OpenStack Board Member and Co-Founder
CEO of Piston Computing
25. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
CLOUD APIS
Everything (should) have an API in the Cloud
• Deltacloud(ruby)
• Daisein(java)
• Jclouds(java)
• Libcloud(python)
• Fog(ruby)
26. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
CLOUD STORAGE
Virtualized, Distributed usually on Commodity Hardware
Project Description
Ceph Distributed file storage system developed by DreamHost ->
InkTank -> Red Hat (block, object, file)
GlusterFS Scale Out NAS system aggregating storage over Ethernet or
Infiniband (file)
OpenStack
Storage
Long-term object storage system (object)
Riak CS Riak CS is open source software designed to provide simple,
available, distributed cloud storage at any scale. Riak CS is S3-
API compatible and supports per-tenant reporting for billing and
metering use cases. (object)
Sheepdog Distributed storage for KVM hypervisors, distributed iSCSI
27. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
PLATFORM-AS-A-SERVICE
Abstracted Cloud-Scale Run-Time Environments
Project Sponsors Languages/Frameworks
CloudFoundry VMware -> Pivotal -> CloudFoundry
Foundation
Spring for Java, Ruby for Rails and
Sinatra, node.js, Grails, Scala on
Lift and more via partners (e.g.
Python, PHP)
Cloudify Gigaspaces [Groovy for deployment recipes]
OpenShift Origin Red Hat Java, Ruby, PHP, Perl and Python
Apache Stratos WSO2 - >Apache Stratus PHP, Tomcat, MySQL “cartridges”
28. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
SOFTWARE DEFINED
NETWORKING(SDN)Virtualization meets the network
Decoupling of the control and data planes of the network to
improve efficiency. Communication from a SDN controller via a
protocol to network devices both physical and virtual.
Automation
Dynamic Networks
Security
Heterogeneous Management
Abstractions allow for programmable networks.
Network can be changed quickly via a controller
Network offerings can match virtualization offerings for finer
grained security in a highly volatile compute landscape.
Single control point for various devices.
29. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
Business Applications
Network Services
SDN
Control
Software
API API
Network DevicesNetwork DevicesNetwork Devices
Network DevicesNetwork DevicesNetwork Devices
Application
Layer
Control
Layer
Infrastructure
Layer
Control Data Plane Interface (e.g. OpenFlow)
SDN OVERVIEW
30. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
BENEFITS OF SDN
Network Virtualization is the final frontier of Software Defined Datacenter
• Dynamically update networks
• Automate network
functionality
• “Program” security into the
network
• Centrally apply policies to
network and services
• Optimize networks
31. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
OPENFLOW
Virtualization meets the network
OpenFlow enables networks to
evolve, by giving a remote
controller the power to modify
the behavior of network
devices, through a well-defined
"forwarding instruction set".
The growing OpenFlow
ecosystem now includes
routers, switches, virtual
switches, and access points
from a range of vendors.
32. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
Project Description
Floodlight The Floodlight Open SDN Controller is an enterprise-class, Apache-licensed, Java-based OpenFlow
Controller. It is supported by a community of developers including a number of engineers from Big Switch
Networks. - See more at: http://www.projectfloodlight.org/floodlight/#sthash.9IhA1Ih5.dpuf
Indigo Indigo is an open source project aimed at enabling support for OpenFlow on physical and hypervisor
switches. Big Switch has helped numerous companies OpenFlow enable their equipment, and we
provide firmware for a number of popular switches. Indigo is the basis of Switch Light by Big Switch
Networks. - See more at: http://www.projectfloodlight.org/indigo/#sthash.K7LiHcqc.dpuf
Lincx LINCX is a pure OpenFlow software switch written in Erlang. It runs within a separate domain under Xen
hypervisor using LING (erlangonxen.org).
Nox NOX is the original OpenFlow controller, and facilitates development of fast C++ controllers on Linux.
Open Daylight Linux Foundation Collaborative Project based on Cisco One Controller and plugins from numerous
vendors in development. E.g IBM DOVE
Open vSwitch Open vSwitch is a open source (ASL 2.0), multilayer virtual switch designed to enable massive network
automation through programmatic extension, while still supporting standard management interfaces and
protocols (e.g. NetFlow, sFlow, SPAN, RSPAN, CLI, LACP, 802.1ag).
OPEN SOURCE SDN
Software Defined Network Controllers and more
33. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
OPEN VSWITCH
Open vSwitch is a production quality,
multilayer virtual switch licensed under the
open source Apache 2.0 license. It is
designed to enable massive network
automation through programmatic extension,
while still supporting standard management
interfaces and protocols (e.g. NetFlow, sFlow,
SPAN, RSPAN, CLI, LACP, 802.1ag).
To learn more please visit our website:
http://openvswitch.org/
34. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
Project Year Started Language License Client/Server
CFengine 1993 C Apache Yes
Chef 2009 Ruby Apache Chef Solo – No
Chef Server - Yes
Puppet 2004 Ruby GPL Yes & standalone
Salt 2011 Python Apache yes
Hitchhiker’s Guide to the
Open Cloud by @mrhinkle
34
CONFIGURATION MANAGEMENT
TOOLS
Tools with features for configuring cloud infrastructure
35. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
Project Description
Ansible Ansible's SSH-key based access allows contributors to the Fedora Project to assist in
automating infrastructure while having access limited appropriately. (Originally authored Func)
Capistrano Utility and framework for executing commands in parallel on multiple remote machines, via SSH.
It uses a simple DSL that allows you to define tasks, which may be applied to machines in
certain roles
RunDeck Rundeck is an open-source process automation and command orchestration tool with a web
console.
Func Func provides a two-way authenticated system for generically executing tasks, integrations with
puppet and cobbler.
MCollective The Marionette Collective AKA MCollective is a framework to build server orchestration or
parallel job execution systems.
Salt Execute arbitrary shell commands or choose from dozens of pre-built modules of common (or
complex) commands.
Scalr Provide scaling across multiple cloud computing platforms, integrates with Chef.
CLOUD AUTOMATION TOOLS
One to many tools for managing large numbers of devices
36. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
ASGARD ASTYANAX EDDA
EUREKA PRIAM SIMIAN ARMY
36
http://netflix.github.com
NETFLIX AWS TOOLBAG
Tools developed by a super Amazon Web Services Power User
37. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
CONTACT ME
Happy to Chat about Open Source, Cloud or Pittsburgh Sports
Professional: mark.hinkle@citrix.com
Personal: mrhinkle@gmail.com
Phone: 919.228.8049
Professional: http://open.citrix.com
Personal: http://www.socializedsoftware.com
Twitter: @mrhinkle
38. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
APPENDIX A
Additional Links to related stuff
39. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
• Devops Toolchains Group
• Software Defined Networking: The New Norm for Networks
(Whitepaper)
• DevOps Wikipedia Page
• NoSQL-Database.org – Ultimate Guide to the Non-Relational Universe
• Open Cloud Initiative
• NIST Cloud Computing Platform
• Open Virtualization Format Specs
• Clouderati Twitter Account
• Planet DevOps
• Nicira Whitepaper – It’s Time to Virtualize the Network
• Why Open vSwitch FAQ
• Stanford Seminar - Software-Defined Networking at the Crossroads
ADDITIONAL LINKS
40. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
• SDN, NFV, and open source: The Operator’s View
• Puppet Labs: Build a Toolbox for Continuous Delivery
ADDITIONAL LINKS (CONT’D)
41. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
APPENDIX B
Stuff I’d liked to have talked
about but didn’t have time
42. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
SOURCING CLOUD APPLIANCES
Packaging Engines forVMs
Tool/Project What you can do with them
Bitnami BitNami provides free, ready to run environments for your favorite open source
web applications and frameworks, including Drupal, Joomla!, Wordpress, PHP,
Rails, Django and many more.
Boxgrinder BoxGrinder is a set of projects that help you grind out appliances for multiple
virtualization and Cloud providers
Oz Command-line tool that has the ability to create images for common Linux
distributions to run on KVM
SUSE Studio SUSE Studio supports building and deploying directly to cloud services such as
Amazon EC2.
43. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
Project Type of Monitoring Collection Methods
Cacti / RRDTool Performance SNMP, syslog
Graphite Performance Agent
Nagios Availability SNMP,TCP, ICMP, IPMI,
syslog
Sensu Availability Agent
Zabbix Availability/ Performance and more SNMP, TCP/ICMP, IPMI,
Synthetic Transactions
Zenoss Availability, Performance, Event
Management
SNMP, ICMP, SSH, syslog,
WMI
Hitchhiker’s Guide to the
Open Cloud by @mrhinkle
43
CLOUD MONITORING TOOLS
Tools with features for monitoring cloud infrastructure
44. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
Project Installation Targets
Apache Provisionr
(incubating)
Can provision 10s to 1000s of machines on various clouds.
Cobbler Distributed virtual infrastructure using koan (kickstart of a network to PXE
boot VMs) for Red Hat, OpenSUSE Fedora, Debian, Ubuntu VMs
Crowbar (Bare metal provisioning)
JuJu Public Clouds - Amazon Web Services HP Cloud,
Private OpenStack clouds, Bare Metal via MAAS.
Salt Cloud Tool to provision “salted” VMs that can then be updated by a central server
via ZeroMQ
Hitchhiker’s Guide to the
Open Cloud by @mrhinkle
44
CLOUD PROVISIONING TOOLS
Packaging Engines forVMs
45. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
BIG DATA
46. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
NOSQL DATABASES
Horizontally scalable unstructured data retrieval
Name Type Description
Apache
Cassandra
Wide Column
Store/Families
API: many » Query Method: MapReduce, Replicaton: , Written in: Java, Concurrency: eventually
consistent , Misc: like "Big-Table on Amazon Dynamo alike", initiated by Facebook
CouchDB Document Store API: Memcached API+protocol (binary and ASCII) , most languages, Protocol: Memcached REST interface
for cluster conf + management, Written in: C/C++ + Erlang (clustering), Replication: Peer to Peer, fully
consistent, Misc: Transparent topology changes during operation, provides memcached-compatible
caching buckets
HBase Wide Column
Store/Families
API: Java / any writer, Protocol: any write call, Query Method: MapReduce Java / any exec, Replication:
HDFS Replication, Written in: Java
Hypertable Wide Column
Store/Families
PI: Thrift (Java, PHP, Perl, Python, Ruby, etc.), Protocol: Thrift, Query Method: HQL, native Thrift API,
Replication: HDFS Replication, Concurrency: MVCC, Consistency Model: Fully consistent Misc: High
performance C++ implementation of Google's Bigtable.
MongoDB Document Store API: BSON, Protocol: C, Query Method: dynamic object-based language & MapReduce, Replication:
Master Slave & Auto-Sharding, Written in: C++,Concurrency
Redis Key Value/ Tuple Store API: Tons of languages, Written in: C, Concurrency: in memory and saves asynchronous disk after a
defined time. Append only mode available. Different kinds of fsync policies. Replication: Master / Slave,
Misc: also lists, sets, sorted sets, hashes, queues.
Riak Key Value / Tuple Store API: JSON, Protocol: REST, Query Method: MapReduce term matching , Scaling: Multiple Masters; Written
in: Erlang, Concurrency: eventually consistent (stronger then MVCC via Vector Clocks)
47. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
MAP REDUCE
Algorithm for Parallelized Data Set Processing
Problem
Data
Master
Node
Worker
Node 1
Worker
Node 2
Worker
Node 3
Solution
Data
Map
Reduce
48. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
APACHE HADOOP
Apache Project for Parallelized Data Set Processing
Overview
• Handles large amounts of
data
• Stores data in native format
• Delivers linear scalability at
low cost
• Resilient in case of
infrastructure failures
• Transparent application
scalability
Features
• Handles large amounts of
data
• Stores data in native format
• Delivers linear scalability at
low cost
• Resilient in case of
infrastructure failures
• Transparent application
scalability
49. By Mark R. Hinkle
@mrhinkle
mrhinkle@gmail.com
OSCON 2014 - Crash Course in Open Source Cloud Computing
Hadoop Hadoop Common
HDFS
Distributes & replicates data
across machines
MapReduce
Distributes & monitors tasks
Hive
Data warehouse that
provides SQL interface.
Ad hoc projection of
data structure to
unstructured
MapReduce
• Parallel programming
• Handles large data blocks
Non-Relational DB
HBase
Column-oriented
schema-less distributed
DB modeled after
Google’s BigTable
Random real time
read/write.
Scripting
Pig
Platform for
manipulating and
analyzing large data sets.
Scripting language for
analysts.
Mahout
Machine learning
libraries for
recommendations ,
clustering, classifications
and item sets.
Machine Learning
ChuckwaZookeeper
APACHE HADOOP ECOSYSTEM
Editor's Notes
Image portability across hypervisorshttps://www.ibm.com/developerworks/community/blogs/9e696bfa-94af-4f5a-ab50-c955cca76fd0/entry/image_portability_across_hypervisors1?lang=en
Martin Fowler - Continuous Integration
http://www.martinfowler.com/articles/continuousIntegration.html
MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") orchestrates by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.