Sahara presentation latest - Codemotion Rome 2015

ROME 27-28 march 2015 - Speaker’s name
Dive into Sahara
Davide Del Vecchio
Francesco Vollero
Matteo Bernacchi
March 27, 2015

Davide Del Vecchio
•Principal Domain
Architect Middleware
•Previous experience
with analytics and Big
Data
•Background in Science
•Passionate about
technology
Who are we
Francesco Vollero
● OpenStack Technical
Specialist in EMEA
● Developer background -
in Openstack since
Grizzly
● Core contributor in
packstack, openstack-
puppet
● Snooping other
openstack components
like Sahara
● Functional programming
brain oriented :)
Matteo Bernacchi
•Senior Infrastructure
Consultant
•Experienced in cloud
solutions deployment
•Supporter of FOSS
technologies since 2003

•An introduction to Big Data
•An overview of the OpenStack components
•A (Moderately) Brief Introduction to Sahara
•Sahara in action
Agenda

Everything You Ever Wanted to Know
About Big Data But Only Had About 20
Minutes to Learn

Insert some very Big Data here …
What is it
•Something you cannot drag'n
drop
•Something you cannot think
to process in a reasonable
amount of time on your
machines
•Something that needs on-
purpose algorithm to work
with

It is not a just a matter of volume ...
There are many other key aspects
•Data must be processed in a small time
frame
• Data sets are different from traditional
relational/not relational including
machine and social data
•The large availability of computational
and mathematical tools in the open
source goes beyond the academia
•It's the second iteration of the feedback
process of open source tools that are now
available as a commodity
•Data visualization tools is an accelerator
to the movement

How do I commoditize Big Data

-2004: MapReduce Whitepaper (Google)
- Described the MapReduce algorithm
- Kind of a big deal
-Many were already doing this; it's a very basic prescription
-Specification for easy extensibility
-THIS was the big deal
-Google's vision for clean extension points and design drove
the Big Data movement
A Bit of History: MapReduce

-2007: Apache Hadoop
-First and still most significant OSS Big Data engine
-Originally built by Yahoo!
-“Hadoop” now used to refer both to Hadoop itself and the
large ecosystem of supporting technologies
-Dominant in the market now, but there are new contenders
-Named after a developer's son's stuffed elephant
A Bit of History: Hadoop

MapReduce: What Does It Do
•MAP
•Iterate over records
•Emit (0, 1, or n) key-value
pairs for each
•Word Count:
•Input: “Let's reduce map
reduce”
•Output: (“Let's”: 1),
(“reduce”: 1), (“map”: 1),
(“reduce”: 1)
•REDUCE
•Gather all the KVPs for each
key together
•Apply some function to all of
each key's values and emit
something for each key
•Word Count:
•Input: {“Let's”: [1], “map”: [1],
“reduce”: [1, 1]}
•Ouptut: {“Let's”: 1, “map”: 1,
“reduce”: 2}

So... It's... GROUP BY.
•Yes, it is kinda GROUP BY.
•You are now authorized to
laugh at Big Data engineers.
•It is, however, VERY easy to
parallelize.
•M Mappers can be run against
any amount of data on any
number of nodes, in small
chunks
•N Reducers only have to deal
with the data for any one key at
a time

MapReduce Extension Points(Per Hadoop
MapReduce Interface)
•An Input Reader
•Divides data into “splits” (1 per
mapper)
•Usually 16-128MB
•A Map Function
•A Combiner Function
•Just a reduce function within a mapper
process
•With a combiner, mappers only emit
one KVP per key
•A Partition Function
•Determines which key goes to which
reducer
•Default is hash(key) % len(reducers)
•(Optional) A Compare Function
•Orders final output
•A Reduce Function
•An Output Writer
•By default, writes one file per reducer and
just dumps text

MapReduce Abstraction Layers
•Hive (SQL-like)
•DROP TABLE IF EXISTS words;
•CREATE TABLE words( text string )
row format delimited fields
terminated by 'n' stored as
textfile;
•LOAD DATA LOCAL INPATH
‘data_path' OVERWRITE INTO TABLE
words;
•SELECT word, COUNT(*) FROM words
LATERAL VIEW explode(split(text,' '))
lTable AS word GROUP BY word;
•Pig (relational flow)
•raw_input = LOAD './input.txt‘;
•words = FOREACH raw_input
GENERATE
FLATTEN(TOKENIZE((chararray)$0)) AS
word;
•grouped = GROUP words BY word;
•counted = FOREACH grouped
GENERATE group, COUNT(words);
•STORE counted INTO './wordcount';

Hadoop: HDFS
Hadoop Distributed File System
•Large block size
•128MB default
Replication
•3 default, 512 max
Strictly separate from logic –
can be used with any algo
•Giraph: Graph Processing
•Mahout: Machine Learning
•The name node tracks data
blocks and replication
•Data nodes hold data

Hadoop: Data Processing
•Namenode tasks
•Breaks jobs (whole dataset) into tasks (one mapper or reducer)
•Assigns tasks to data nodes
•Tracks progress to completion
•Retry failed tasks a configurable number of times
•Allows Hadoop clusters to be run on error-prone commodity hardware
•Datanode tasks
•Tracks its own map and reduce jobs
•Transfers data to other nodes as needed
•Each data node has slots for map and reduce tasks (to be run in JVMs)

Hadoop: The Ecosystem
•Oozie: Workflow manager
(chained jobs)
•Data pipelining: Flume, Scribe,
Kafka
•RDBMS integration: Sqoop
•Tabular interface for
unstructured data: Hcatalog
•M/R Abstraction: Pig, Hive
•SO MANY OTHERS

OpenStack: take a look at the best place
to host your Big Data platform
OpenStack: take a look at the best place
to host your Big Data platform

Why does the world need OpenStack?
● Cloud is widely seen as the next-generation IT delivery model
o Agile & Flexible
o Utility-based on-demand consumption
o Self-service driving down administrative overhead and
maintenance
● Public clouds are setting the benchmark of how IT could be delivered to
users
o Not all organisations are ready for public cloud
● Applications are being written differently today-
o More tolerant of failure
o Making use of scale-out architecture

● Our data is too large
o Volumes of data are being generated at unprecedented levels
o Most of this data is unstructured
● Service requests are too large
o More and more devices are coming online
o Tablets, phones, laptops, BYOD generation…
● Crucially, applications weren’t written to cope with the demand!
o Traditional infrastructure capabilities are being exhausted
o Service uptime, QoS, KPI’s and SLA’s are slipping
Major issues with traditional infrastructure…

Workloads are evolving…
● Typically each tier resides on a single machine
● Doesn’t tolerate any downtime
● Relies on underlying infrastructure for
availability
● Applications scale-up, not out
● Workload resides across multiple machines
● Applications built to tolerate failure
● Does not rely on underlying infrastructure
● Applications scale-out, not up
Cloud-enabled Workloads
Traditional workloads

Or an easier analogy...
PETS = TRADITIONAL WORKLOADS FARM ANIMALS = CLOUD WORKLOADS
● Farm animals have tag
numbers like
piggie242.redhat.com
● They are almost identical to
each other
● When they get ill you get
another one
● Pets are given names like
lasy.internal.redhat.com
● They are unique, lovingly hand
raised and cared for
● When they get ill you nurse
them back to health

OpenStack is typically suitable for the following use cases —
● A public cloud-like Infrastructure-as-a-Service cloud platform
o Internal “Infrastructure on Demand” - private cloud
o Test and Development environments - e.g. sandbox
o Cloud service provider platform - reselling compute, network &
storage
● Building a scale-out platform for cloud-enabled workloads
o Web-scale applications, e.g. NetFlix-like, photo/video-streaming
o Academic or pharma workloads, e.g. genetic sequencing
So, how does OpenStack fit in?

•OpenStack is made up of individual autonomous components
•All of which are designed to scale-out to accommodate throughput and
availability
•OpenStack is considered more of a framework, that relies on drivers and
plugins
•Largely written in Python and is heavily dependent on Linux
OpenStack Architecture

•Keystone provides a common authentication and authorisation store for OpenStack
•Responsible for users, their roles, and to which project(s) they belong to
•Provides a catalogue of all other OpenStack services
•All OpenStack services typically rely on Keystone to verify a user’s request
OpenStack Identity Service (Keystone)

•Nova is responsible for the lifecycle of running instances within OpenStack
•Manages multiple different hypervisor types via drivers, e.g-
•Red Hat Enterprise Linux (+KVM)
•VMware vSphere
OpenStack Compute (Nova)

•Glance provides a mechanism for the storage and retrieval of disk
images/templates
•Supports a wide variety of image formats, including qcow2, vmdk, ami, vhd
and ova
•Many different backend storage options for images, including Swift…
OpenStack Image Service (Glance)

•Swift provides a mechanism for storing and retrieving arbitrary unstructured data
•Provides an object based interface via a RESTful/HTTP-based API
•Highly fault-tolerant with replication, self-healing, and load-balancing
•Architected to be implemented using commodity compute and storage
OpenStack Object Store (Swift)

•Neutron is responsible for providing networking to running instances within
OpenStack
•Provides an API for defining, configuring, and using networks
•Relies on a plugin architecture for implementation of networks, examples include-
•Open vSwitch (default in Red Hat’s distribution)
•Cisco, PLUMgrid, VMware NSX, Arista, Mellanox, Brocade, etc.
OpenStack Networking (Neutron)

•Cinder provides block storage to instances running within OpenStack
•Used for providing persistent and/or additional storage
•Relies on a plugin/driver architecture for implementation, examples include-
• Red Hat Storage (GlusterFS), IBM XIV, HP Leftland, 3PAR, etc.
OpenStack Volume Service (Cinder)

•Heat facilitates the creation of ‘application stacks’ made from multiple resources
•Stacks are imported as a descriptive template language
•Heat manages the automated orchestration of resources and their dependencies
•Allows for dynamic scaling of applications based on configurable metrics
OpenStack Orchestration (Heat)

•Ceilometer is a central collection of metering and monitoring data
•Primarily used for chargeback of resource usage
•Ceilometer consumes data from the other components - e.g. via agents
•Architecture is completely extensible - meter what you want to - expose via API
OpenStack Telemetry (Ceilometer)

•Horizon is OpenStack’s web-based self-service portal
•Sits on-top of all of the other OpenStack components via API interaction
•Provides a subset of underlying functionality
•Examples include: instance creation, network configuration, block storage attachment
•Exposes an administrative extension for basic tasks, e.g. user creation
OpenStack Dashboard (Horizon)

•All OpenStack components expose a RESTful API for communication
•A stateless, shared-nothing API service provides scalability and fault-tolerance
•Keystone manages a list of these API endpoints in its catalog
Common OpenStack Architecture

Where’s
Nova?
http://server0:8773
server1:877
3
server2:8773
server3:8773
L
B
server0:877
3

•In addition to providing API services, each component has a set of workers
•These workers actually do the heavy lifting behind the scenes
•Workers (and API services) scale-out and communicate using a message bus
(RabbitMQ)
•Example with Nova:
Nova
API
Nova
Compute
Nova
Compute
Nova
Compute
RabbitMQ
AMQP

• In addition to providing API services, each component has a set of workers
• These workers actually do the heavy lifting behind the scenes
• Workers (and API services) scale-out and communicate using a message bus (RabbitMQ)
• Example with Nova:
Nova
API
Nova
Compute
Nova
Compute
Nova
Compute
RabbitMQ
AMQP

• OpenStack services store state information in a SQL-based database, default is MySQL
• Each service can use it’s own database infrastructure or share a common platform
• For resilience and throughput, replicated multi-master databases can be implemented
• Example with Keystone:
Keystone
Server
L
B
Multi-Master Replication
Using Galera

• OpenStack services check a users request with Keystone for both authentication and authorisation
• Example with Nova:
Keystone
Server
Nova
API
Launch an
Instance
1) Are they authenticated?
2) Are they allowed to launch an
instance?
Success/Fai
l

OpenStack Architecture

OpenStack Sahara, or what we supposed
to talk about today

Hadoop without Sahara: the challenges
•Hadoop clusters are difficult to configure and few have the
expert knowledge to do fine
•Commodity hardware is cheap but requires frequent (costly,
expert) maintenance
•Demand for data processing varies over time, even with
sophisticated scheduling
•Baremetal Hadoop cluster nodes can fail, leading to a loss of
service
•Many public BigData services don't give you flexibility

Hadoop with Sahara: beat the challenges
•OpenStack Sahara lets you to:
•Deploy Hadoop Clusters (predictable and repeatable)
•Scaling the deployed clusters
•Define and run jobs
•Offer a programmatic API interface and a web console
•Furthermore:
•It support many Hadoop Distributions
•It is well integrated with other OpenStack Services
•Enables to use Hadoop even with little knowledge about it

Sahara: the project
History:
•Started at Portland Summit
•Incubated in Icehouse
•Integrated in Juno
Main components:
•Sahara REST API
•Python REST Client and Sahara Pages (Integrated with Horizon)
•Elastic Data Processing
•Provisioning Engine
•Vendor Plugins (Vanilla, Intel, Hortonworks, Cloudera, MapR)

Sahara: Architecture

Sahara: Usecases
•Cluster Management (API V1.0)
•On-demand, scalable, persistent clusters
•Supports multiple plugins
•Integrates with Heat, Glance, Nova, Neutron, and Cinder
•EDP (Elastic Data Processing ) (API V1.1)
•Supports multiple job types (Java, MR, Hive, Pig, Spark...)
•Supports transient clusters (spin up, process, shut down) or persistent clusters
•Integrates with Swift (optionally) and services on Vms

Sahara: end-user workflow

Questions ?

Sahara presentation latest - Codemotion Rome 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Sahara presentation latest - Codemotion Rome 2015

Similar to Sahara presentation latest - Codemotion Rome 2015 (20)

More from Codemotion

More from Codemotion (20)

Sahara presentation latest - Codemotion Rome 2015

Editor's Notes