Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

7,394 views

Published on

Flink Forward 2015

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

  1. 1. Interactive Flink analytics with HopsWorks and Zeppelin Jim Dowling Ermias Gebermeskel www.hops.io @hopshadoop
  2. 2. Marketing 101: Celebrity Endorsements *Turing Award Winner 2014, Father of Distributed Systems Hi! I’m Leslie Lamport* and even though you’re not using Paxos, I approve this product.
  3. 3. Talk Overview •Multi-tenancy in Hadoop •Multi-tenancy in HopsWorks •Free-Text Search of Hadoop Metadata in HopsWorks •Zeppelin and Flink in HopsWorks 3
  4. 4. Goal: Multi-Tenancy and Data Sharing 4 Project NSA Project X No Unauthorized Copying/Cross-Linking of Data DataSetowns authorize access
  5. 5. Access Control in Relational Databases # How do we provide multi-tenancy for users alice and bob using two databases db1 and db2? grant all privileges on db1.* to ‘alice'@‘%‘; grant all privileges on db2.* to ‘bob'@‘%‘; #More fine-grained privileges grant SELECT privileges on db2.sensitiveTable to ‘alice'@‘192.168.1.2‘; 5 What happens to the privileges if I call “drop table db2.sensitiveTable”?
  6. 6. Access Control in Hadoop: Apache Sentry 6 How do you ensure the consistency of the policies and the data? [Mujumdar’15]
  7. 7. Policy Editor for Sentry 7
  8. 8. Performance of Policy Enforcement Points (PEP) 8 *https://docs.wso2.com/display/IS500/XACML+Performance+in+the+Identity+Server
  9. 9. PEPs + Hadoop = Horse-Drawn Sportscar 9 Policy Enforcement Engines ≈ O(2,000) ops/sec HopsFS Distributed Filesystem ≈ O(100,000) ops/sec Horse-Drawn Sportscar
  10. 10. HopsWorks 10
  11. 11. Users, DataSets, and Projects In-Place Data Sharing - not Copying! DataSet2DataSet1 DataSet3 Project 1 Project 2 Project 3
  12. 12. User •Authentication Provider - JDBC Realm - 2-Factor Authentication - LDAP 12
  13. 13. Project •Members - Roles: Owner, Data Scientist •DataSets - Home project - Can be shared 13
  14. 14. Project Roles •Owner Privileges - Import/Export data - Manage Membership - Share DataSets •Data Scientist Privileges - Write code - Run code - Request access to DataSets 14 We delegate administration of privileges to users
  15. 15. Sharing DataSets between Projects 16 The same as Sharing Folders in Dropbox
  16. 16. Delegate Access Control to HDFS •HDFS enforces access control •Convention for directories •Hadoop and HopsWorks use the same Users and Groups in a common DB •UserId per Project •GroupId per Project and DataSet 17 With Hadoop metadata in a DB, we guarantee policy integrity with Foreign Keys
  17. 17. Engine – HopsFS, HopsYARN 18
  18. 18. HopsFS 19 Stateless NameNodes NDB Leader HopsWorks DataNodes J2EE Server HopsWorks J2EE Server Metadata & policies
  19. 19. HopsYARN 20 ResourceMgrs NDB Scheduler NodeManagers Resource Trackers HopsWorks J2EE Server HopsWorks J2EE Server Metadata & policies
  20. 20. Data Abstraction Layer (DAL) 21 NameNode (Apache v2) DAL API (Apache v2) NDB-DAL-Impl (GPL v2) Other Impl (Other License) hops-2.4.0.jar dal-ndb-2.4.0-7.4.7.jar ResourceMgr (Apache v2)
  21. 21. Hops Performance 22
  22. 22. HopsFS Metadata Scaleout 23Assuming 256MB Block Size, 100 GB JVM Heap for Apache Hadoop
  23. 23. HopsFS Throughput (Real Workload) 24Experiments performed on AWS EC2 with enhanced networking and C3.8xLarge instances
  24. 24. What else can we do with metadata in a DB? 25
  25. 25. How ACME Inc. handles Free-Text Search 26 HDFS In Theory Unified Search and Update API In Practice Inconsistent Metadata
  26. 26. Global Search: Projects and DataSets 27
  27. 27. Project Search: Files, Directories 28
  28. 28. Design your own Extended Metadata 29
  29. 29. MetaData Entry 30
  30. 30. Free Text Search with Consistent Metadata 31 Free-Text Search Distributed Database ElasticSearch The Distributed Database is the Single Source of Truth. Foreign keys ensure the integrity of Metadata. MetaData Designer MetaData Entry
  31. 31. Flink and Zeppelin in HopsWorks 32
  32. 32. Batch Job Analytics 33
  33. 33. Interactive Analytics: Flink on Zeppelin
  34. 34. Other Features •Audit Logs •Erasure Coding Replication •Online upgrade of Hops (and NDB) •Automated Installation with Karamel •Tinker friendly – easy to extend metadata! 35
  35. 35. Conclusions •Hops is a next-generation distribution of Hadoop. •HopsWorks is a frontend to Hops that supports true multi-tenancy, free-text search, interactive analytics with Zeppelin/Flink/Spark, and batch jobs. •Looking for contributors/committers - Pick-me-up on GitHub 36 www.hops.io
  36. 36. The Team Academics: Jim Dowling, Seif Haridi PostDocs: Gautier Berthou PhDs: Salman Niazi, Mahmoud Ismail, Kamal Hakimzadeh MSc Students:K.Srijeyanthan “Sri”, Evangelos Savvidis, Seçkin Savaşçı, Ermias Gebremeskel Alumini: Steffen Grohsschmiedt , Theofilos Kakantousis, Stig Viaene, Andre Moré, Qi Qi, Alberto Lorente, Hooman Peiro, Jude D’Souza, Nikolaos Stanogias, Daniel Bali, Ioannis Kirkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu. 37
  37. 37. Hops [Hadoop For Humans] www.hops.io @hopshadoop
  38. 38. HDFS v2 Architecture 39 DataNodes HDFS Client Journal Nodes Zookeeper Snapshot Node NameNode Standby NameNode Active-Standby Replication of NN Log Agreement on the Active NameNode Faster Recovery - Cut the NN Log Doesn’t Scale Out
  39. 39. YARN Architecture 40 NodeManagers YARN Client Zookeeper ResourceMgr Standby ResourceMgr 1. Master-Slave Replication of RM State 2. Agreement on the Active ResourceMgr

×