Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Strata feb2013

878 views

Published on

Slides from Strata talk "Coordinating the Many Tools of Big Data"

Published in: Technology
  • Be the first to comment

Strata feb2013

  1. 1. Coordinating the Many Tools of Big DataStrata 2013Alan F. Gates@alanfgates Page 1
  2. 2. Big Data = Terabytes, Petabytes, …Image Credit: Gizmodo © Hortonworks 2013 Page 2
  3. 3. But It Is Also Complex Algorithms• An example from a talk by Jimmy Lin at Hadoop Summit 2012 on calculations Twitter is doing via UDFs in Pig. This equation uses stochastic gradient descent to do machine learning with their data: w(t+1) =w(t) −γ(t)∇(f(x;w(t)),y) © Hortonworks 2013 Page 3
  4. 4. And New Tools• Apache Hadoop brings with it a large selection of tools and paradigms – Apache HBase, Apache Cassandra – Distributed, high volume reads and rights of individual data records – Apache Hive - SQL – Apache Pig, Cascading – Data flow programming for ETL, data modeling, and exploration – Apache Giraph – Graph processing – MapReduce – Batch processing – Storm, S4 – Stream processing – Plus lots of commercial offerings © Hortonworks 2013 Page 4
  5. 5. Pre-Cloud: One Tool per Machine• Databases presented SQL or SQL-like paradigms for operating on data• Other tools came in separate packages (e.g. R) or on separate platforms (e.g. SAS). Data Mart Statistical Analysis Data Warehouse Cube/M OLTP OLAP © Hortonworks 2013 Page 5
  6. 6. Cloud: Many Tools One Platform • Users no longer want to be concerned with what platform their data is in – just apply the tool to it • SQL no longer the only or primary data access tool Statistical Data Analysis Mart DataWarehouseCube/M OLT OLAP P © Hortonworks 2013 Page 6
  7. 7. Upside - Pick the Right Tool for the Job © Hortonworks 2013 Page 7
  8. 8. Downside – Tools Don’t Play Well Together• Hard for users to share data between tools – Different storage formats – Different data models – Different user defined function interfaces © Hortonworks 2013 Page 8
  9. 9. Downside – Wasted Developer Time• Wastes developer time since each tool supplies the redundant functionality Hive Pig Parser Parser Metadata Optimizer Optimizer Physical Physical Planner Planner Executor Executor © Hortonworks 2013 Page 9
  10. 10. Downside – Wasted Developer Time• Wastes developer time since each tool supplies the redundant functionality Hive Pig Parser Parser Metadata Optimizer Optimizer Physical Physical Overlap Planner Planner Executor Executor © Hortonworks 2013 Page 10
  11. 11. Conclusion: We Need Services• We need to find a way to share services where we can• Gives users the same experience across tools• Allows developers to share effort when it makes sense © Hortonworks 2013 Page 11
  12. 12. Hadoop = Distributed Data OperatingSystemService Hadoop ComponentTable Management HiveAccess To Metadata HCatalogUser authentication KnoxResource management YARNNotification HCatalogREST/Connectors webhcat, webhdfs, Hive, HBase, OozieRelational data processing Tez Exists Pieces exist in this component New Project © Hortonworks 2013 Page 12
  13. 13. Hadoop = Distributed Data OperatingSystemService Hadoop ComponentTable Management HiveAccess To Metadata HCatalogUser authentication KnoxResource management YARNNotification HCatalogREST/Connectors webhcat, webhdfs, Hive, HBase, OozieRelational data processing Tez Exists Pieces exist in this component New Project © Hortonworks 2013 Page 13
  14. 14. HCatalog – Table Management• Opens up Hive’s tables to other tools inside and outside Hadoop• Presents tools with a table paradigm that abstracts away storage details• Provides a shared data model• Provides a shared code path for data and metadata access © Hortonworks 2013 Page 14
  15. 15. HCatalog – Table Management• Opens up Hive’s tables to other tools inside and outside Hadoop• Presents tools with a table paradigm that abstracts away storage details• Provides a shared data model• Provides a shared code path for data and metadata access Hive Metastore © Hortonworks 2013 Page 15
  16. 16. HCatalog – Table Management• Opens up Hive’s tables to other tools inside and outside Hadoop• Presents tools with a table paradigm that abstracts away storage details• Provides a shared data model• Provides a shared code path for data and metadata access Hive Pig HCat Loader Metastore MapReduce HCatInput Format © Hortonworks 2013 Page 16
  17. 17. HCatalog – Table Management• Opens up Hive’s tables to other tools inside and outside Hadoop• Presents tools with a table paradigm that abstracts away storage details• Provides a shared data model• Provides a shared code path for data and metadata access Hive Pig External Systems HCat Loader REST WebHCat Metastore MapReduce HCatInput Format © Hortonworks 2013 Page 17
  18. 18. Tez – Moving Beyond MapReduce• Low level data-processing execution engine• Use it for the base of MapReduce, Hive, Pig, Cascading etc.• Enables pipelining of jobs• Removes task and job launch times• Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline• Does not write intermediate output to HDFS – Much lighter disk and network usage• Built on YARN © Hortonworks 2013 Page 18
  19. 19. Pig/Hive-MR versus Pig/Hive-Tez SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Job 1 Job 2I/O Synchronization Barrier I/O Synchronization Barrier Job 3 Pig/Hive - MR © Hortonworks 2013 Page 19
  20. 20. Pig/Hive-MR versus Pig/Hive-Tez SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Job 1 Job 2I/O Synchronization Barrier I/O Synchronization Barrier Single Job Job 3 Pig/Hive - MR Pig/Hive - Tez © Hortonworks 2013 Page 20
  21. 21. FastQuery: Beyond Batch with YARN Tez Generalizes Map-Reduce Always-On Tez ServiceSimplified execution plans process Low latency processing for data more efficiently all Hadoop data processing © Hortonworks 2013 Page 21
  22. 22. Knox – Single Sign On © Hortonworks 2013 Page 22
  23. 23. Today’s Access Options• Direct Access – Access Services via REST (WebHDFS, WebHCat) – Need knowledge of and access to whole cluster – Security handled by each component in the cluster – Kerberos details exposed to users User {REST} Hadoop Cluster• Gateway / Portal Nodes – Dedicated nodes behind firewall – User SSH to node to access Hadoop services SSH GW User Hadoop Cluster Node © Hortonworks 2013 Page 23
  24. 24. Knox Design Goals• Operators can firewall cluster without end user access to “gateway node”• Users see one cluster end-point that aggregates capabilities for data access, metadata and job control• Provide perimeter security to make Hadoop security setup easier• Enable integration enterprise and cloud identity management environments © Hortonworks 2013 Page 24
  25. 25. Perimeter Verification & AuthenticationVerification- Verify identity token Authentication Hadoop Cluster- SAML, propagation of identityAuthentication User Store- Establish identity at Gateway to Authenticate with LDAP + AD KDC, AD, DN DN LDAP Web DN DN HDFS NN {REST} Knox Client Gateway JT Web Hive ID Provider HCat KDC, AD, LDAP HCat Verification © Hortonworks 2013 Page 25
  26. 26. Thank You © Hortonworks 2012 Page 26

×