Data Science with the Help of Metadata

Data Science with the Help of Metadata
Jim Dowling
Associate Prof @ KTH
Senior Researcher @ SICS
CEO @ Logical Clocks AB
www.hops.io
@hopshadoop

Metadata for Source Code
•Metadata for Source Code
- Enables questions like: who, when, what, why?
•Metadata for Automation
- Enables testing, quality-control, deployment.
•Metadata for Collaboration
- Github projects, teams

Metadata for Datasets?
•Access Control
•Data provenance
•Auditing
•Development
- Schema for the dataset
- How can I load/download this dataset?
- Quality control
3

Metadata can simplify development
sqlContext = HiveContext(sc)
f1_df = sqlContext.sql(
"SELECT id, count(*) AS nb_entries
FROM my_db.log
WHERE ts = '20160515'
GROUP BY id"
)
sqlContext = SQLContext(sc)
f0 = sc.textFile('logfile')
fpFields = [
StructField(‘ts', StringType(), True),
StructField('id', StringType(), True),
StructField(‘it', StringType(), True)
]
fpSchema = StructType(fpFields)
df_f0 = sqlContext.createDataFrame(f0,
fpSchema)
df_f0.registerTempTable('log')
f1_df = sqlContext.sql(
"SELECT log.id, count(*) AS nb_entries
FROM log WHERE ts = '20160515‘
GROUP BY id“
)
4
SparkSQLHive-on-Spark

Hive is Metadata for HDFS files
5

Metadata for Files/Directories in HDFS
6
Add Schemas using
the Filesystem API
Add auditing using
the FSImage API
Add access control using
a Filesystem Plugin

Access Control in Hadoop
hdfs dfs -chmod -R 000 /apps/hive
7
[http://hortonworks.com/blog/best-practices-in-hdfs-authorization-with-apache-ranger]

Metadata Totem Poles in Hadoop
8How do you ensure the consistency of the metadata and the data?

Why are the Metadata Services Silo’ed?
9

HDFS v2
10
DataNodes
HDFS Client
Journal Nodes Zookeeper
NameNode Standby
NameNode
Max 200 GB
metadata

YARN
11
NodeManagers
YARN Client
Zookeeper
ResourceMgr Standby
ResourceMgr
Metadata on the
JVM Heap Again

Hops: Distributed Metadata for Hadoop
12

HopsFS Architecture
13
NameNodes
NDB
Leader
HDFS Client
DataNodes
> 12 TB
> 2.6 X
Throughput
[HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases, Niazi et Al, arXiv 2016]

HopsYARN Architecture
14
ResourceMgrs
NDB
Scheduler
YARN Client
NodeManagers
Resource Trackers
Leader Election for
Failed Scheduler
Up to 10K
Node Clusters

Experience Designing Metadata in Hops
15

Hops Metadata services
Elasticsearch
Database
[HDFS/YARN]
Kafka
Zookeeper
Metadata API
The Distributed Database is the Single Source-of-Truth for Metadata

Metadata for HDFS and YARN
17
Files
Directories
Containers
Provenance
Security
Quotas
Projects
Datasets
Metadata + Data in the same Database
2-phase commit (transactions)
Strong Consistency for Metadata.
Metadata Integrity maintained using 2PC and Foreign Keys.

Metadata in Elasticsearch
18
Files
Directories
Metadata
Search
Indexes
DatabaseElasticsearch one-way replication
Eventual Consistency for Metadata.
Metadata Integrity maintained by Asynchronous Replication.
[ePipe Tutorial, BOSS Workshop, VLDB 2016]

Metadata for Kafka
19
Topics
Partitions
ACLs
Zookeeper/KafkaDatabase
Eventual Consistency for Metadata.
Metadata integrity maintained by custom recovery logic and polling.
Metadata API
polling

Case Study: Self-Service Multi-Tenant Projects
20
www.hops.io
@hopshadoop

Problem: Sensitive Data needs its own Cluster
21
NSA DataSet
User DataSet
Alice can copy/cross-link between data sets
Alice has only one Kerberos Identity.
Neither attribute-based access control nor dynamic roles supported in Hadoop.
Alice

Solution: Project-Specific UserIDs
22
Project NSA
Project Users
Member of
NSA__Alice
Users__Alice
Member of
HDFS enforces
access control
How can we share DataSets between Projects?

Sharing DataSets between Projects
23
Project NSA
Project Users
Member of
DataSetowns
Add members of Project
NSA to the DataSet group
NSA__Alice
Users__Alice
Member of

HopsWorks (WebApp) Enforces Dynamic Roles
24
Alice@gmail.com
NSA__Alice
Authenticate
Users__Alice
HopsWorks
HopsFS
HopsYARN
Projects
Secure
Impersonation
Kafka
X.509
Certificates

X.509 Certificate Per Project-Specific User
25
Alice@gmail.com
Authenticate
Add/Del
Users
Distributed
Database
Insert/Remove CertsProject
Mgr
Root
CA
Services
Hadoop
Spark
Kafka
etc
Cert Signing
Requests

Project
•A project is a collection of
- Members
- HDFS DataSets
- Kafka Topics
- Notebooks, Jobs
•A project has an owner
•A project has quotas
26
project
dataset 1
dataset N
Topic 1
Topic N
Kafka
HDFS

Project Roles
Data Owner Privileges
- Import/Export data
- Manage Membership
- Share DataSets, Topics
Data Scientist Privileges
- Write and Run code
27
We delegate administration of privileges to users

Elastic Hadoop
Each Project has a:
• YARN CPU Quota
• HDFS Storage Quota
Uber-Style Pricing to
incentivize cluster usage
28

Sharing DataSets/Topics between Projects
29
The same as Sharing Folders in Dropbox

Added Multi-Tenancy to Zeppelin

www.hops.site
31
A 2 MW datacenter research and test environment
5 lab modules, planned up to 3-4000 servers, 2-3000 square meters
[Slide by Prof. Tor Björn Minde, CEO SICS North Swedish ICT AB]

Status and Upcoming
•Automated installation support using Vagrant/Chef
or Karamel/Chef
•First official release of Hopsworks coming soon
•Globally shared datasets with peer-to-peer
technology, backed by our data center.
•Support for Apache Beam

Summing Up
Metadata services have the potential to make your
life easier as a Data Scientist
Most Hadoop Metadata services are proprietary and
require an administrator-in-the-loop
Hops provides an open, tinker-friendly platform for
building consistent metadata
Hopsworks shows how you can leverage metadata to
build a self-service project-based model for
Hadoop/Spark/Flink applications
34

The Team
Active: Jim Dowling, Seif Haridi, Tor Björn Minde,
Gautier Berthou, Salman Niazi, Mahmoud Ismail,
Theofilos Kakantousis, Johan Svedlund Nordström,
Vasileios Giannokostas, Ermias Gebremeskel,
Antonios Kouzoupis, Misganu Dessalegn, Rizvi Hasan,
Paul Mälzer, Bram Leenders, Juan Roca.
Alumni: K. “Sri” Srijeyanthan, Steffen Grohsschmiedt,
Alberto Lorente, Andre Moré, Ali Gholami,
Stig Viaene, Hooman Peiro, Evangelos Savvidis,
Jude D’Souza, Qi Qi, Gayana Chandrasekara,
Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos,
Peter Buechler, Pushparaj Motamari, Hamid Afzali,
Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.

Join us!
http://github.com/hopshadoop
www.hops.io
@hopshadoop

Data Science with the Help of Metadata

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Data Science with the Help of Metadata

Similar to Data Science with the Help of Metadata (20)

More from Jim Dowling

More from Jim Dowling (20)

Recently uploaded

Recently uploaded (20)

Data Science with the Help of Metadata