Django+NoSQL
HOW Hue Integrates
with Hadoop
Abraham Elmahrek
Cloudera - March 5th, 2014

Monday, March 3, 14
What is Hue?
HUE 1

Desktop-like in a browser,
did its job but pretty slow,
memory leaks and not very
IE friendly but definitely
advanced for its time
(2009-2010).

Monday, March 3, 14
HISTORY
HUE 2

The first flat structure port,
with Twitter Bootstrap all
over the place.

Monday, March 3, 14
HISTORY
HUE 2.5

New apps, improved the UX
adding new nice
functionalities like
autocomplete and drag &
drop.

Monday, March 3, 14
HISTORY
HUE 3 ALPHA

Proposed design, didn’t
make it.

Monday, March 3, 14
HISTORY
HUE 3

Transition to the new UI,
major improvements and
new apps.

Monday, March 3, 14
HISTORY
HUE 3.5+

Monday, March 3, 14
Monday, March 3, 14

RE

O

ET
AS
T

M

B
BR

R

H

...

M
E

O

H

K

SP
AR

ER
Y

U

Q

IN

M

AD
DB

R

SE

U

ER

EP

R

SE

O
W

BR
O
P

O
O
KE

ZO

SQ

SE

BA

H

AR
C

SE

BR
A
O
W
SE
R

PA
L

IM

O
DE
W
SI
SE
G
O
R
N
O
ER
ZI
H
E
IV
E

B

JO

G

PI

SE

O
W

BR

JO

LE

FI

APPS
APPS

Hue Plugins
YARN

Monday, March 3, 14

JobTracker

Pig
Oozie

Cloudera
Impala

HiveServer2
HDFS

Hive
Metastore

HBase
Solr

Zookeeper
Sqoop2

LDAP
SAML
FAST PACE
LAST MONTH

91 issues created and 90
resolved.
Core team + Community

Monday, March 3, 14
STACK
BACKEND
Python + Django (2.6+/
1.4.5)

Monday, March 3, 14

FRONTEND
jQuery
Bootstrap
Knockout.js
Love
HADOOP INTERFACES
REST & THRIFT

Many Hadoop interfaces
used
CUSTOM CLIENTS

Provide custom clients for
more explicit API definitions

Monday, March 3, 14

WebHDFS
YARN API (RM, NM, MR...)
HiveServer2
Impala
HBase
Oozie
Sqoop2
ZooKeeper
...
PROTOCOLS
REST

Use python-requests and a
custom client to streamline
RESTful interface calls.
Thrift

Custom connection pooling
and socket multiplexing to
streamline thrift calls.

Monday, March 3, 14

http_client.HttpClient(url,
exc_class=WebHdfsException,
logger=LOG)
if security_enabled:
client.set_kerberos_auth()
return client
thrift_util.get_client(TCLIService.Client,
query_server['server_host'],
query_server['server_port'],
service_name=query_server['server_name'],
kerberos_principal=kerberos_principal_short_name,
use_sasl=use_sasl,
mechanism=mechanism,
username=user.username,
timeout_seconds=conf.SERVER_CONN_TIMEOUT.get(),
use_ssl=conf.SSL.ENABLED.get(),
ca_certs=conf.SSL.CACERTS.get(),
keyfile=conf.SSL.KEY.get(),
certfile=conf.SSL.CERT.get(),
validate=conf.SSL.VALIDATE.get())
ACCESSIBILITY
Middleware

Make Hadoop interfaces
accessible in request objects

class ClusterMiddleware(object):
def process_view(self, request, ...):
request.fs = cluster.get_hdfs(request.fs_ref)
if request.user.is_authenticated():
if request.fs is not None:
request.fs.setuser(request.user.username)

def download(request, path):
if not request.fs.exists(path):
raise Http404(_("File not found."))
if not request.fs.isfile(path):
raise PopupException(_("not a file."))

Monday, March 3, 14
HDFS
Goal

Easily browse, create, read,
update, and delete files in
HDFS

Monday, March 3, 14
HDFS - Communication
REST

The NameNode provides a
RESTful server called
WebHDFS
Explicit Client

Provide an API that is explicit

Request Accessible

Provide a middleware for
populating a request
member

Monday, March 3, 14

http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE
http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN
...

class WebHdfs(Hdfs):
def create(self, path, ...):
...
def read(self, path, ...):
...
def download(request, path):
if not request.fs.exists(path):
raise Http404(_("File not found."))
if not request.fs.isfile(path):
raise PopupException(_("not a file."))
HDFS - Cool Things
MIME Type Detection

Detect the various kinds of
files being read: Avro, GZIP,
etc.
Pagination

Nice pagination by block
size when viewing a file
(soon to be more like a PDF
reader with content
automatically being added)

Monday, March 3, 14
HBase
Goal

Make it easy to view and
search HBase

Monday, March 3, 14
HBase - Technical Risk
2 Dimensions

Infinitely many columns and
rows

Sparseness

Column names will often
differ per row

Monday, March 3, 14
HBase - Communication
Thrift

Communicate with HBase
using Thrift for better
filtering

Explicit Client

Provide an API that is explicit

Monday, March 3, 14

class HBaseApi(Hdfs):
def createTable(self, cluster, tableName, ...):
...
def getRows(self, cluster, tableName, columns, ...):
...
HBase - Results
Improved View

Intelligent view that
collapses null cells

Better Search

Improved searchability of
HBase via flexible search
MIME Type Detection

Able to view documents in
HBase: PDF, images, etc

Monday, March 3, 14
Hive
Goal

Make it easy to run queries
in Hive

Monday, March 3, 14
Hive - Communication
Thrift

Communicate with
HiveServer2 using Thrift

Explicit Client

Provide a higher level API
that is explicit and easy to
configure
DBMS

Further the capacities of the
DBMS in Hue

Monday, March 3, 14

thrift_util.get_client(TCLIService.Client,
query_server['server_host'],
query_server['server_port'],
service_name=query_server['server_name'],
...)
class HiveServerClient:
HS2_MECHANISMS = {'KERBEROS': 'GSSAPI', 'NONE': 'PLAIN',
'NOSASL': 'NOSASL'}
def __init__(self, query_server, user, ...):
thrift_util.get_client(TCLIService.Client,
...
class HiveServer2Dbms(object):
def get_databases(self):
return self.client.get_databases()
...
def select_star_from(self, database, table):
hql = "SELECT * FROM `%s.%s` %s" % (database,
table.name, self._get_browse_limit_clause(table))
return self.execute_statement(hql)
...
Hive - Results
One Page App

Intelligent view that lets
users worry about their
queries
Secure

Achieved some level of
security through SASL,
Kerberos, and SSL
Navigation

Able to navigate databases
and tables easily

Monday, March 3, 14
DEMO
TIME

Monday, March 3, 14
Missed something?
GET STARTED

Take a closer look at REST and Thrift
communication in Hue
The inner workings of the Filebrowser
The fundamentals of the HBase browser
The concepts behind the Beeswax app

Monday, March 3, 14
What else does Hue do with Django?
Extensible settings

Security

Doc Model

Configuration of settings.py
provided through the hue.ini

Configurable session
timeouts, SAML
authentication, etc.

Polymorphic documents via
a base document model

Authentication

Permissions

Testing

LDAP, PAM, OAuth, etc.
provided through
authentication backends

Per-app permissions
configurable in the
UserAdmin

Mocked and functional tests
via nose + django-nose

Monday, March 3, 14
GET HUE
CLOUDERA’S CDH

TARBALL

CLOUDERA’S DEMO VM

Stable and highly tested
releases perfectly
integrated with the
Hadoop ecosystem,
automagically configured
by Cloudera Manager.

Try in advance the latest
and greatest but you’ll
have to configure
everything on your own.

HORTONWORKS*

MAPR*

In HDP there’s an old
forked version of Hue
2.3.

Newer version than HDP,
close to the original 2.5
minus apps like HBase,
Impala, Sqoop, Search.

Get to play with Hue and
various Hadoop
components in 5
minutes. It’s a self
contained CDH
environment ready to
HP CLOUD*
use.
The newest addition,
ships Hue 3.0 through
the GreenButton
products.

BIGTOP

EMBEDDED/DEMO IN IND. COMPANIES

* YOUR MILEAGE MAY VARY.

Monday, March 3, 14
LINKS
WEBSITE

http://gethue.com
GITHUB

https://github.com/cloudera/hue/
BLOG

http://blog.gethue.com
TWITTER

@gethue
USER GROUP

hue-user@

Monday, March 3, 14
THANKS.
QUESTIONS?

gethue.com

Monday, March 3, 14

How Hue integrates Hadoop with Django

  • 1.
    Django+NoSQL HOW Hue Integrates withHadoop Abraham Elmahrek Cloudera - March 5th, 2014 Monday, March 3, 14
  • 2.
    What is Hue? HUE1 Desktop-like in a browser, did its job but pretty slow, memory leaks and not very IE friendly but definitely advanced for its time (2009-2010). Monday, March 3, 14
  • 3.
    HISTORY HUE 2 The firstflat structure port, with Twitter Bootstrap all over the place. Monday, March 3, 14
  • 4.
    HISTORY HUE 2.5 New apps,improved the UX adding new nice functionalities like autocomplete and drag & drop. Monday, March 3, 14
  • 5.
    HISTORY HUE 3 ALPHA Proposeddesign, didn’t make it. Monday, March 3, 14
  • 6.
    HISTORY HUE 3 Transition tothe new UI, major improvements and new apps. Monday, March 3, 14
  • 7.
  • 8.
    Monday, March 3,14 RE O ET AS T M B BR R H ... M E O H K SP AR ER Y U Q IN M AD DB R SE U ER EP R SE O W BR O P O O KE ZO SQ SE BA H AR C SE BR A O W SE R PA L IM O DE W SI SE G O R N O ER ZI H E IV E B JO G PI SE O W BR JO LE FI APPS
  • 9.
    APPS Hue Plugins YARN Monday, March3, 14 JobTracker Pig Oozie Cloudera Impala HiveServer2 HDFS Hive Metastore HBase Solr Zookeeper Sqoop2 LDAP SAML
  • 10.
    FAST PACE LAST MONTH 91issues created and 90 resolved. Core team + Community Monday, March 3, 14
  • 11.
    STACK BACKEND Python + Django(2.6+/ 1.4.5) Monday, March 3, 14 FRONTEND jQuery Bootstrap Knockout.js Love
  • 12.
    HADOOP INTERFACES REST &THRIFT Many Hadoop interfaces used CUSTOM CLIENTS Provide custom clients for more explicit API definitions Monday, March 3, 14 WebHDFS YARN API (RM, NM, MR...) HiveServer2 Impala HBase Oozie Sqoop2 ZooKeeper ...
  • 13.
    PROTOCOLS REST Use python-requests anda custom client to streamline RESTful interface calls. Thrift Custom connection pooling and socket multiplexing to streamline thrift calls. Monday, March 3, 14 http_client.HttpClient(url, exc_class=WebHdfsException, logger=LOG) if security_enabled: client.set_kerberos_auth() return client thrift_util.get_client(TCLIService.Client, query_server['server_host'], query_server['server_port'], service_name=query_server['server_name'], kerberos_principal=kerberos_principal_short_name, use_sasl=use_sasl, mechanism=mechanism, username=user.username, timeout_seconds=conf.SERVER_CONN_TIMEOUT.get(), use_ssl=conf.SSL.ENABLED.get(), ca_certs=conf.SSL.CACERTS.get(), keyfile=conf.SSL.KEY.get(), certfile=conf.SSL.CERT.get(), validate=conf.SSL.VALIDATE.get())
  • 14.
    ACCESSIBILITY Middleware Make Hadoop interfaces accessiblein request objects class ClusterMiddleware(object): def process_view(self, request, ...): request.fs = cluster.get_hdfs(request.fs_ref) if request.user.is_authenticated(): if request.fs is not None: request.fs.setuser(request.user.username) def download(request, path): if not request.fs.exists(path): raise Http404(_("File not found.")) if not request.fs.isfile(path): raise PopupException(_("not a file.")) Monday, March 3, 14
  • 15.
    HDFS Goal Easily browse, create,read, update, and delete files in HDFS Monday, March 3, 14
  • 16.
    HDFS - Communication REST TheNameNode provides a RESTful server called WebHDFS Explicit Client Provide an API that is explicit Request Accessible Provide a middleware for populating a request member Monday, March 3, 14 http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN ... class WebHdfs(Hdfs): def create(self, path, ...): ... def read(self, path, ...): ... def download(request, path): if not request.fs.exists(path): raise Http404(_("File not found.")) if not request.fs.isfile(path): raise PopupException(_("not a file."))
  • 17.
    HDFS - CoolThings MIME Type Detection Detect the various kinds of files being read: Avro, GZIP, etc. Pagination Nice pagination by block size when viewing a file (soon to be more like a PDF reader with content automatically being added) Monday, March 3, 14
  • 18.
    HBase Goal Make it easyto view and search HBase Monday, March 3, 14
  • 19.
    HBase - TechnicalRisk 2 Dimensions Infinitely many columns and rows Sparseness Column names will often differ per row Monday, March 3, 14
  • 20.
    HBase - Communication Thrift Communicatewith HBase using Thrift for better filtering Explicit Client Provide an API that is explicit Monday, March 3, 14 class HBaseApi(Hdfs): def createTable(self, cluster, tableName, ...): ... def getRows(self, cluster, tableName, columns, ...): ...
  • 21.
    HBase - Results ImprovedView Intelligent view that collapses null cells Better Search Improved searchability of HBase via flexible search MIME Type Detection Able to view documents in HBase: PDF, images, etc Monday, March 3, 14
  • 22.
    Hive Goal Make it easyto run queries in Hive Monday, March 3, 14
  • 23.
    Hive - Communication Thrift Communicatewith HiveServer2 using Thrift Explicit Client Provide a higher level API that is explicit and easy to configure DBMS Further the capacities of the DBMS in Hue Monday, March 3, 14 thrift_util.get_client(TCLIService.Client, query_server['server_host'], query_server['server_port'], service_name=query_server['server_name'], ...) class HiveServerClient: HS2_MECHANISMS = {'KERBEROS': 'GSSAPI', 'NONE': 'PLAIN', 'NOSASL': 'NOSASL'} def __init__(self, query_server, user, ...): thrift_util.get_client(TCLIService.Client, ... class HiveServer2Dbms(object): def get_databases(self): return self.client.get_databases() ... def select_star_from(self, database, table): hql = "SELECT * FROM `%s.%s` %s" % (database, table.name, self._get_browse_limit_clause(table)) return self.execute_statement(hql) ...
  • 24.
    Hive - Results OnePage App Intelligent view that lets users worry about their queries Secure Achieved some level of security through SASL, Kerberos, and SSL Navigation Able to navigate databases and tables easily Monday, March 3, 14
  • 25.
  • 26.
    Missed something? GET STARTED Takea closer look at REST and Thrift communication in Hue The inner workings of the Filebrowser The fundamentals of the HBase browser The concepts behind the Beeswax app Monday, March 3, 14
  • 27.
    What else doesHue do with Django? Extensible settings Security Doc Model Configuration of settings.py provided through the hue.ini Configurable session timeouts, SAML authentication, etc. Polymorphic documents via a base document model Authentication Permissions Testing LDAP, PAM, OAuth, etc. provided through authentication backends Per-app permissions configurable in the UserAdmin Mocked and functional tests via nose + django-nose Monday, March 3, 14
  • 28.
    GET HUE CLOUDERA’S CDH TARBALL CLOUDERA’SDEMO VM Stable and highly tested releases perfectly integrated with the Hadoop ecosystem, automagically configured by Cloudera Manager. Try in advance the latest and greatest but you’ll have to configure everything on your own. HORTONWORKS* MAPR* In HDP there’s an old forked version of Hue 2.3. Newer version than HDP, close to the original 2.5 minus apps like HBase, Impala, Sqoop, Search. Get to play with Hue and various Hadoop components in 5 minutes. It’s a self contained CDH environment ready to HP CLOUD* use. The newest addition, ships Hue 3.0 through the GreenButton products. BIGTOP EMBEDDED/DEMO IN IND. COMPANIES * YOUR MILEAGE MAY VARY. Monday, March 3, 14
  • 29.
  • 30.