Polyglot metadata for Hadoop

Polyglot Metadata for Hadoop
Jim Dowling
Associate Prof @ KTH
Senior Researcher @ SICS
CEO @ Logical Clocks AB
www.hops.io
@hopshadoop
s9s Polyglot Persistence Meetup,
23rd August 2016 @ Spotify/Stockholm

Polyglot Persistence: the Right Tool for the Job
2

Minimize Impedence Mismatch
Data Access Paradigm
SQL
Static Object Oriented
Dynamic Object Oriented
Functional
Free-Text Search
Matrix Operations
Database Paradigm
Relational (Row-based)
Relational (Columnar)
Document-Based
Graph-Based
Key-Value
Object-Oriented /
Hierarchical / Network
3
?

In the Good Old Dayz….
4
[http://martinfowler.com/]

Led to Big Fights over Choice of Database
5

The Advent of Microservices
6

Should we give teams freedom to choose?
•Ericsson
- 1000s of build Systems
•Google
- One Build System, Blaze
7

Monolithic vs Polyglot Persistence
8

Problems: Polyglot Persistence/Microservices
•What if the same data is updated by
different microservices?
-E.g., user data in subscriber systems
•Transactions (or agreement protocols)
across microservices is very hard
•Where possible, try to store such data in
a single database (single source of truth)
….
9

Problems: Polyglot Persistence/Microservices
•…but, if we minimize the number of
databases……
•…how do we handle different data access
patterns for the same data?
-OLTP SQL
-OLAP SQL
-Free-Text Search
-Etc
10

Multi-functional Databases
11
SAP Hana, MemSQL combine OLTP and OLAP in a single DB

If OLAP doesn’t mutate your data…..
12
[http://severalnines.com/]
OLTP OLAP
Replication protocols from the SSOT to other Databases
works for Immutable Data

If you have Big Data….
13
RDBMS
(MySQL, Postgres,..)
Import
Export
SQOOP

Hadoop as a Polyglot Storage System
14

Hive as Polyglot Storage in Hadoop

Hive/Spark simplifies development
sqlContext = HiveContext(sc)
f1_df = sqlContext.sql(
"SELECT id, count(*) AS nb_entries
FROM my_db.log
WHERE ts = '20160515'
GROUP BY id"
)
sqlContext = SQLContext(sc)
f0 = sc.textFile('logfile')
fpFields = [
StructField(‘ts', StringType(), True),
StructField('id', StringType(), True),
StructField(‘it', StringType(), True)
]
fpSchema = StructType(fpFields)
df_f0 = sqlContext.createDataFrame(f0,
fpSchema)
df_f0.registerTempTable('log')
f1_df = sqlContext.sql(
"SELECT log.id, count(*) AS nb_entries
FROM log WHERE ts = '20160515‘
GROUP BY id“
)
16
SparkSQLHive-on-Spark

Hive as Metadata for HDFS Files
17
hive
warehouse
database
Table.hive
Hadoop Distributed Filesystem (HDFS)
Table
Hive DOESN’T maintain the integrity of the metadata table and the HDFS file!!
Remove, the table Table.hive file and Hive doesn’t complain….
SQL

Let’s Dig Down and Understand Why…
18

Metadata in HDFS: Paths, Blocks, Replicas
Journal Nodes
NameNode
Zookeeper
Standby
NameNode
FSImage
Edit Log
HDFS Metadata cannot be easily extended (with a Schema, for example)
HDFS
Metadata
Architecture

Metadata Totem Poles in Hadoop
20How do they ensure the consistency of the metadata and the data?

Hops Hadoop Data Integratation Journey
•Principled and safe mechanisms for adding metadata
to HDFS and YARN
•Free-text search for files in HDFS
- Instead of Pig Jobs to search HDFS, use ElasticSearch
•Integrate Kafka Metadata into Hops Hadoop
- Kafka uses Zookeeper for ACLs, Metadata, etc
- Hops Hadoop uses MySQL Cluster

HopsFS Architecture
23
NameNodes
NDB
Leader
HDFS Client
DataNodes
> 12 TB
[HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases, Niazi et Al, arXiv 2016]

HopsYARN Architecture
24
ResourceMgrs
NDB
Scheduler
YARN Client
NodeManagers
Resource Trackers

Strongly Consistent Metadata for HDFS
25
Files
Directories
Schema
….
Single Database for HDFS Metadata
2-phase commit (transactions)
Metadata Integrity with Transactions and Foreign Keys.

HDFS Metadata API
Schema-less API
attach(path, metadataJSON)
detach(hdfsPath)
Schema-based API
register(path, schemaJSON)
add(path, schema, metadataJSON)
remove(path, schema, metadataJSON)
26

Hops Metadata services
Elasticsearch
Database
[HDFS/YARN]
Kafka
Zookeeper
Metadata API
Distributed Database is the Single Source-of-Truth for Metadata

Good Metadata: Elasticsearch
28
Files
Directories
Metadata
Search
Indexes
DatabaseElasticsearch one-way replication
Eventual Consistency for Metadata.
Metadata Integrity maintained by
Asynchronous Replication and Metadata Immutability.
[ePipe Tutorial, BOSS Workshop, VLDB 2016]
immutable data

Not-so-Good Metadata: Kafka
29
Topics
Partitions
ACLs
Zookeeper/KafkaDatabase
Eventual Consistency for Metadata.
Metadata integrity maintained by
custom recovery logic and polling.
Metadata API
polling

www.hops.site
31
A 2 MW datacenter research and test environment
5 lab modules, planned up to 3-4000 servers, 2-3000 square meters
[Slide by Prof. Tor Björn Minde, CEO SICS North Swedish ICT AB]

Summing Up
Multiple storage engines are here to stay
Picking the right tool for the right job is not easy
Do not be religious about your tool of choice!
Hops shows how you can combine multiple storage
engines to give an improved user experience for
Hadoop
33

The Hops Team
Active: Jim Dowling, Seif Haridi, Tor Björn Minde,
Gautier Berthou, Salman Niazi, Mahmoud Ismail,
Theofilos Kakantousis, Johan Svedlund Nordström,
Vasileios Giannokostas, Ermias Gebremeskel,
Antonios Kouzoupis, Misganu Dessalegn.
Alumni: Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca,
K “Sri” Srijeyanthan, Steffen Grohsschmiedt,
Alberto Lorente, Andre Moré, Ali Gholami,
Stig Viaene, Hooman Peiro, Evangelos Savvidis,
Jude D’Souza, Qi Qi, Gayana Chandrasekara,
Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos,
Peter Buechler, Pushparaj Motamari, Hamid Afzali,
Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.

Join us!
http://github.com/hopshadoop
www.hops.io
@hopshadoop

Polyglot metadata for Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Polyglot metadata for Hadoop

Similar to Polyglot metadata for Hadoop (20)

More from Jim Dowling

More from Jim Dowling (20)

Recently uploaded

Recently uploaded (20)

Polyglot metadata for Hadoop

Editor's Notes