Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019
Sahara presentation latest - Codemotion Rome 2015
1. ROME 27-28 march 2015 - Speaker’s name
Dive into Sahara
Davide Del Vecchio
Francesco Vollero
Matteo Bernacchi
March 27, 2015
2. ROME 27-28 march 2015 - Speaker’s name
Davide Del Vecchio
•Principal Domain
Architect Middleware
•Previous experience
with analytics and Big
Data
•Background in Science
•Passionate about
technology
Who are we
Francesco Vollero
● OpenStack Technical
Specialist in EMEA
● Developer background -
in Openstack since
Grizzly
● Core contributor in
packstack, openstack-
puppet
● Snooping other
openstack components
like Sahara
● Functional programming
brain oriented :)
Matteo Bernacchi
•Senior Infrastructure
Consultant
•Experienced in cloud
solutions deployment
•Supporter of FOSS
technologies since 2003
3. ROME 27-28 march 2015 - Speaker’s name
•An introduction to Big Data
•An overview of the OpenStack components
•A (Moderately) Brief Introduction to Sahara
•Sahara in action
Agenda
4. ROME 27-28 march 2015 - Speaker’s name
Everything You Ever Wanted to Know
About Big Data But Only Had About 20
Minutes to Learn
5. ROME 27-28 march 2015 - Speaker’s name
Insert some very Big Data here …
What is it
•Something you cannot drag'n
drop
•Something you cannot think
to process in a reasonable
amount of time on your
machines
•Something that needs on-
purpose algorithm to work
with
6. ROME 27-28 march 2015 - Speaker’s name
It is not a just a matter of volume ...
There are many other key aspects
•Data must be processed in a small time
frame
• Data sets are different from traditional
relational/not relational including
machine and social data
•The large availability of computational
and mathematical tools in the open
source goes beyond the academia
•It's the second iteration of the feedback
process of open source tools that are now
available as a commodity
•Data visualization tools is an accelerator
to the movement
7. ROME 27-28 march 2015 - Speaker’s name
How do I commoditize Big Data
8. ROME 27-28 march 2015 - Speaker’s name
-2004: MapReduce Whitepaper (Google)
- Described the MapReduce algorithm
- Kind of a big deal
-Many were already doing this; it's a very basic prescription
-Specification for easy extensibility
-THIS was the big deal
-Google's vision for clean extension points and design drove
the Big Data movement
A Bit of History: MapReduce
9. ROME 27-28 march 2015 - Speaker’s name
-2007: Apache Hadoop
-First and still most significant OSS Big Data engine
-Originally built by Yahoo!
-“Hadoop” now used to refer both to Hadoop itself and the
large ecosystem of supporting technologies
-Dominant in the market now, but there are new contenders
-Named after a developer's son's stuffed elephant
A Bit of History: Hadoop
10. ROME 27-28 march 2015 - Speaker’s name
MapReduce: What Does It Do
•MAP
•Iterate over records
•Emit (0, 1, or n) key-value
pairs for each
•Word Count:
•Input: “Let's reduce map
reduce”
•Output: (“Let's”: 1),
(“reduce”: 1), (“map”: 1),
(“reduce”: 1)
•REDUCE
•Gather all the KVPs for each
key together
•Apply some function to all of
each key's values and emit
something for each key
•Word Count:
•Input: {“Let's”: [1], “map”: [1],
“reduce”: [1, 1]}
•Ouptut: {“Let's”: 1, “map”: 1,
“reduce”: 2}
11. ROME 27-28 march 2015 - Speaker’s name
So... It's... GROUP BY.
•Yes, it is kinda GROUP BY.
•You are now authorized to
laugh at Big Data engineers.
•It is, however, VERY easy to
parallelize.
•M Mappers can be run against
any amount of data on any
number of nodes, in small
chunks
•N Reducers only have to deal
with the data for any one key at
a time
12. ROME 27-28 march 2015 - Speaker’s name
MapReduce Extension Points(Per Hadoop
MapReduce Interface)
•An Input Reader
•Divides data into “splits” (1 per
mapper)
•Usually 16-128MB
•A Map Function
•A Combiner Function
•Just a reduce function within a mapper
process
•With a combiner, mappers only emit
one KVP per key
•A Partition Function
•Determines which key goes to which
reducer
•Default is hash(key) % len(reducers)
•(Optional) A Compare Function
•Orders final output
•A Reduce Function
•An Output Writer
•By default, writes one file per reducer and
just dumps text
13. ROME 27-28 march 2015 - Speaker’s name
MapReduce Abstraction Layers
•Hive (SQL-like)
•DROP TABLE IF EXISTS words;
•CREATE TABLE words( text string )
row format delimited fields
terminated by 'n' stored as
textfile;
•LOAD DATA LOCAL INPATH
‘data_path' OVERWRITE INTO TABLE
words;
•SELECT word, COUNT(*) FROM words
LATERAL VIEW explode(split(text,' '))
lTable AS word GROUP BY word;
•Pig (relational flow)
•raw_input = LOAD './input.txt‘;
•words = FOREACH raw_input
GENERATE
FLATTEN(TOKENIZE((chararray)$0)) AS
word;
•grouped = GROUP words BY word;
•counted = FOREACH grouped
GENERATE group, COUNT(words);
•STORE counted INTO './wordcount';
14. ROME 27-28 march 2015 - Speaker’s name
Hadoop: HDFS
Hadoop Distributed File System
•Large block size
•128MB default
Replication
•3 default, 512 max
Strictly separate from logic –
can be used with any algo
•Giraph: Graph Processing
•Mahout: Machine Learning
•The name node tracks data
blocks and replication
•Data nodes hold data
15. ROME 27-28 march 2015 - Speaker’s name
Hadoop: Data Processing
•Namenode tasks
•Breaks jobs (whole dataset) into tasks (one mapper or reducer)
•Assigns tasks to data nodes
•Tracks progress to completion
•Retry failed tasks a configurable number of times
•Allows Hadoop clusters to be run on error-prone commodity hardware
•Datanode tasks
•Tracks its own map and reduce jobs
•Transfers data to other nodes as needed
•Each data node has slots for map and reduce tasks (to be run in JVMs)
16. ROME 27-28 march 2015 - Speaker’s name
Hadoop: The Ecosystem
•Oozie: Workflow manager
(chained jobs)
•Data pipelining: Flume, Scribe,
Kafka
•RDBMS integration: Sqoop
•Tabular interface for
unstructured data: Hcatalog
•M/R Abstraction: Pig, Hive
•SO MANY OTHERS
17. ROME 27-28 march 2015 - Speaker’s name
OpenStack: take a look at the best place
to host your Big Data platform
OpenStack: take a look at the best place
to host your Big Data platform
19. ROME 27-28 march 2015 - Speaker’s name
Why does the world need OpenStack?
● Cloud is widely seen as the next-generation IT delivery model
o Agile & Flexible
o Utility-based on-demand consumption
o Self-service driving down administrative overhead and
maintenance
● Public clouds are setting the benchmark of how IT could be delivered to
users
o Not all organisations are ready for public cloud
● Applications are being written differently today-
o More tolerant of failure
o Making use of scale-out architecture
20. ROME 27-28 march 2015 - Speaker’s name
● Our data is too large
o Volumes of data are being generated at unprecedented levels
o Most of this data is unstructured
● Service requests are too large
o More and more devices are coming online
o Tablets, phones, laptops, BYOD generation…
● Crucially, applications weren’t written to cope with the demand!
o Traditional infrastructure capabilities are being exhausted
o Service uptime, QoS, KPI’s and SLA’s are slipping
Major issues with traditional infrastructure…
21. ROME 27-28 march 2015 - Speaker’s name
Workloads are evolving…
● Typically each tier resides on a single machine
● Doesn’t tolerate any downtime
● Relies on underlying infrastructure for
availability
● Applications scale-up, not out
● Workload resides across multiple machines
● Applications built to tolerate failure
● Does not rely on underlying infrastructure
● Applications scale-out, not up
Cloud-enabled Workloads
Traditional workloads
22. ROME 27-28 march 2015 - Speaker’s name
Or an easier analogy...
PETS = TRADITIONAL WORKLOADS FARM ANIMALS = CLOUD WORKLOADS
● Farm animals have tag
numbers like
piggie242.redhat.com
● They are almost identical to
each other
● When they get ill you get
another one
● Pets are given names like
lasy.internal.redhat.com
● They are unique, lovingly hand
raised and cared for
● When they get ill you nurse
them back to health
23. ROME 27-28 march 2015 - Speaker’s name
OpenStack is typically suitable for the following use cases —
● A public cloud-like Infrastructure-as-a-Service cloud platform
o Internal “Infrastructure on Demand” - private cloud
o Test and Development environments - e.g. sandbox
o Cloud service provider platform - reselling compute, network &
storage
● Building a scale-out platform for cloud-enabled workloads
o Web-scale applications, e.g. NetFlix-like, photo/video-streaming
o Academic or pharma workloads, e.g. genetic sequencing
So, how does OpenStack fit in?
24. ROME 27-28 march 2015 - Speaker’s name
•OpenStack is made up of individual autonomous components
•All of which are designed to scale-out to accommodate throughput and
availability
•OpenStack is considered more of a framework, that relies on drivers and
plugins
•Largely written in Python and is heavily dependent on Linux
OpenStack Architecture
25. ROME 27-28 march 2015 - Speaker’s name
•Keystone provides a common authentication and authorisation store for OpenStack
•Responsible for users, their roles, and to which project(s) they belong to
•Provides a catalogue of all other OpenStack services
•All OpenStack services typically rely on Keystone to verify a user’s request
OpenStack Identity Service (Keystone)
26. ROME 27-28 march 2015 - Speaker’s name
•Nova is responsible for the lifecycle of running instances within OpenStack
•Manages multiple different hypervisor types via drivers, e.g-
•Red Hat Enterprise Linux (+KVM)
•VMware vSphere
OpenStack Compute (Nova)
27. ROME 27-28 march 2015 - Speaker’s name
•Glance provides a mechanism for the storage and retrieval of disk
images/templates
•Supports a wide variety of image formats, including qcow2, vmdk, ami, vhd
and ova
•Many different backend storage options for images, including Swift…
OpenStack Image Service (Glance)
28. ROME 27-28 march 2015 - Speaker’s name
•Swift provides a mechanism for storing and retrieving arbitrary unstructured data
•Provides an object based interface via a RESTful/HTTP-based API
•Highly fault-tolerant with replication, self-healing, and load-balancing
•Architected to be implemented using commodity compute and storage
OpenStack Object Store (Swift)
29. ROME 27-28 march 2015 - Speaker’s name
•Neutron is responsible for providing networking to running instances within
OpenStack
•Provides an API for defining, configuring, and using networks
•Relies on a plugin architecture for implementation of networks, examples include-
•Open vSwitch (default in Red Hat’s distribution)
•Cisco, PLUMgrid, VMware NSX, Arista, Mellanox, Brocade, etc.
OpenStack Networking (Neutron)
30. ROME 27-28 march 2015 - Speaker’s name
•Cinder provides block storage to instances running within OpenStack
•Used for providing persistent and/or additional storage
•Relies on a plugin/driver architecture for implementation, examples include-
• Red Hat Storage (GlusterFS), IBM XIV, HP Leftland, 3PAR, etc.
OpenStack Volume Service (Cinder)
31. ROME 27-28 march 2015 - Speaker’s name
•Heat facilitates the creation of ‘application stacks’ made from multiple resources
•Stacks are imported as a descriptive template language
•Heat manages the automated orchestration of resources and their dependencies
•Allows for dynamic scaling of applications based on configurable metrics
OpenStack Orchestration (Heat)
32. ROME 27-28 march 2015 - Speaker’s name
•Ceilometer is a central collection of metering and monitoring data
•Primarily used for chargeback of resource usage
•Ceilometer consumes data from the other components - e.g. via agents
•Architecture is completely extensible - meter what you want to - expose via API
OpenStack Telemetry (Ceilometer)
33. ROME 27-28 march 2015 - Speaker’s name
•Horizon is OpenStack’s web-based self-service portal
•Sits on-top of all of the other OpenStack components via API interaction
•Provides a subset of underlying functionality
•Examples include: instance creation, network configuration, block storage attachment
•Exposes an administrative extension for basic tasks, e.g. user creation
OpenStack Dashboard (Horizon)
34. ROME 27-28 march 2015 - Speaker’s name
•All OpenStack components expose a RESTful API for communication
•A stateless, shared-nothing API service provides scalability and fault-tolerance
•Keystone manages a list of these API endpoints in its catalog
Common OpenStack Architecture
35. ROME 27-28 march 2015 - Speaker’s name
Common OpenStack Architecture
Where’s
Nova?
http://server0:8773
server1:877
3
server2:8773
server3:8773
L
B
server0:877
3
36. ROME 27-28 march 2015 - Speaker’s name
•In addition to providing API services, each component has a set of workers
•These workers actually do the heavy lifting behind the scenes
•Workers (and API services) scale-out and communicate using a message bus
(RabbitMQ)
•Example with Nova:
Common OpenStack Architecture
Nova
API
Nova
Compute
Nova
Compute
Nova
Compute
RabbitMQ
AMQP
37. ROME 27-28 march 2015 - Speaker’s name
•In addition to providing API services, each component has a set of workers
•These workers actually do the heavy lifting behind the scenes
•Workers (and API services) scale-out and communicate using a message bus
(RabbitMQ)
•Example with Nova:
Common OpenStack Architecture
Nova
API
Nova
Compute
Nova
Compute
Nova
Compute
RabbitMQ
AMQP
38. ROME 27-28 march 2015 - Speaker’s name
• In addition to providing API services, each component has a set of workers
• These workers actually do the heavy lifting behind the scenes
• Workers (and API services) scale-out and communicate using a message bus (RabbitMQ)
• Example with Nova:
Common OpenStack Architecture
Nova
API
Nova
Compute
Nova
Compute
Nova
Compute
RabbitMQ
AMQP
39. ROME 27-28 march 2015 - Speaker’s name
• OpenStack services store state information in a SQL-based database, default is MySQL
• Each service can use it’s own database infrastructure or share a common platform
• For resilience and throughput, replicated multi-master databases can be implemented
• Example with Keystone:
Common OpenStack Architecture
Keystone
Server
L
B
Multi-Master Replication
Using Galera
40. ROME 27-28 march 2015 - Speaker’s name
• OpenStack services check a users request with Keystone for both authentication and authorisation
• Example with Nova:
Common OpenStack Architecture
Keystone
Server
Nova
API
Launch an
Instance
1) Are they authenticated?
2) Are they allowed to launch an
instance?
Success/Fai
l
43. ROME 27-28 march 2015 - Speaker’s name
OpenStack Sahara, or what we supposed
to talk about today
44. ROME 27-28 march 2015 - Speaker’s name
Hadoop without Sahara: the challenges
•Hadoop clusters are difficult to configure and few have the
expert knowledge to do fine
•Commodity hardware is cheap but requires frequent (costly,
expert) maintenance
•Demand for data processing varies over time, even with
sophisticated scheduling
•Baremetal Hadoop cluster nodes can fail, leading to a loss of
service
•Many public BigData services don't give you flexibility
45. ROME 27-28 march 2015 - Speaker’s name
Hadoop with Sahara: beat the challenges
•OpenStack Sahara lets you to:
•Deploy Hadoop Clusters (predictable and repeatable)
•Scaling the deployed clusters
•Define and run jobs
•Offer a programmatic API interface and a web console
•Furthermore:
•It support many Hadoop Distributions
•It is well integrated with other OpenStack Services
•Enables to use Hadoop even with little knowledge about it
46. ROME 27-28 march 2015 - Speaker’s name
Sahara: the project
History:
•Started at Portland Summit
•Incubated in Icehouse
•Integrated in Juno
Main components:
•Sahara REST API
•Python REST Client and Sahara Pages (Integrated with Horizon)
•Elastic Data Processing
•Provisioning Engine
•Vendor Plugins (Vanilla, Intel, Hortonworks, Cloudera, MapR)
OVF Lo crearon DMTF que en principio es una composición de organizaciones incluidos VMware, HP, IBM, Dell, Microsoft y XenSource. Empezo a usarse a partir del 2007 por VMware aunque la especificacion final se hizo en 2008
A really quick plug of Red Hat’s OpenStack distribution - an enterprise-class fully-supported release.
Built specifically for and tightly integrated with RHEL - the #1 enterprise Linux distribution
We follow the upstream 6 month release cadence but take two months
— Enterprise class support from the #1 corporate contributors, and this isn’t just OpenStack - it’s Linux too
A really quick plug of Red Hat’s OpenStack distribution - an enterprise-class fully-supported release.
Built specifically for and tightly integrated with RHEL - the #1 enterprise Linux distribution
We follow the upstream 6 month release cadence but take two months
— Enterprise class support from the #1 corporate contributors, and this isn’t just OpenStack - it’s Linux too
A really quick plug of Red Hat’s OpenStack distribution - an enterprise-class fully-supported release.
Built specifically for and tightly integrated with RHEL - the #1 enterprise Linux distribution
We follow the upstream 6 month release cadence but take two months
— Enterprise class support from the #1 corporate contributors, and this isn’t just OpenStack - it’s Linux too
A really quick plug of Red Hat’s OpenStack distribution - an enterprise-class fully-supported release.
Built specifically for and tightly integrated with RHEL - the #1 enterprise Linux distribution
We follow the upstream 6 month release cadence but take two months
— Enterprise class support from the #1 corporate contributors, and this isn’t just OpenStack - it’s Linux too