SlideShare a Scribd company logo
1 of 25
Download to read offline
Web Services in
Hadoop
Nicholas Sze and Alan F. Gates
@szetszwo, @alanfgates




                                 Page 1
REST-ful API Front-door for Hadoop
• Opens the door to languages other than Java
• Thin clients via web services vs. fat-clients in gateway
• Insulation from interface changes release to release


                              HCatalog web interfaces




                           MapReduce     Pig      Hive

                                       HCatalog




                                                  External
                            HDFS        HBase
                                                   Store


      © 2012 Hortonworks                                     Page 2
Not Covered in this Talk
•  HttpFS (fka Hoop) – same API as WebHDFS but proxied
•  Stargate – REST API for HBase




       © 2012 Hortonworks
                                                         Page 3
HDFS Clients
• DFSClient: the native client
  – High performance (using RPC)
  – Java blinding


• libhdfs: a C++ client interface
  – Using JNI => large overhead
  – Also Java blinding (require Hadoop installing)




     Architecting the Future of Big Data     Page 4
HFTP
• Designed for cross-version copying (DistCp)
  – High performance (using HTTP)
  – Read-only
  – The HTTP API is proprietary
  – Clients must use HftpFileSystem (hftp://)


• WebHDFS is a rewrite of HFTP



     Architecting the Future of Big Data        Page 5
Design Goals

• Support a public HTTP API

• Support Read and Write

• High Performance

• Cross-version

• Security



     Architecting the Future of Big Data   Page 6
WebHDFS features
• HTTP REST API
  – Defines a public API
  – Permits non-Java client implementation
  – Support common tools like curl/wget


• Wire Compatibility
  – The REST API will be maintained for wire compatibility
  – WebHDFS clients can talk to different Hadoop versions.




     Architecting the Future of Big Data     Page 7
WebHDFS features                     (2)

• A Complete HDFS Interface
  – Support all user operations
     – reading files
     – writing to files
     – mkdir, chmod, chown, mv, rm, …


• High Performance
  – Using HTTP redirection to provide data locality
  – File read/write are redirected to the corresponding
    datanodes



     Architecting the Future of Big Data     Page 8
WebHDFS features                     (3)

• Secure Authentication
  – Same as Hadoop authentication: Kerberos (SPNEGO)
    and Hadoop delegation tokens
  – Support proxy users


• A HDFS Built-in Component
  – WebHDFS is a first class built-in component of HDFS.
  – Run inside Namenodes and Datanodes

• Apache Open Source
  – Available in Apache Hadoop 1.0 and above.

     Architecting the Future of Big Data   Page 9
WebHDFS URI & URL
• FileSystem scheme:
          webhdfs://

• FileSystem URI:
          webhdfs://<HOST>:<HTTP_PORT>/<PATH>

• HTTP URL:
  http://<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?op=..

  – Path prefix:    /webhdfs/v1
  – Query:          ?op=..



      Architecting the Future of Big Data   Page 10
URI/URL Examples
•  Suppose we have the following file
     hdfs://namenode:8020/user/szetszwo/w.txt

•  WebHDFS FileSystem URI
    webhdfs://namenode:50070/user/szetszwo/w.txt

•  WebHDFS HTTP URL
http://namenode:50070/webhdfs/v1/user/
szetszwo/w.txt?op=..

•  WebHDFS HTTP URL to open the file
http://namenode:50070/webhdfs/v1/user/
szetszwo/w.txt?op=OPEN

      Architecting the Future of Big Data   Page 11
Example: curl
•  Use curl to open a file

$curl -i -L "http://namenode:50070/webhdfs/v1/user/
szetszwo/w.txt?op=OPEN"

HTTP/1.1 307 TEMPORARY_REDIRECT
Content-Type: application/octet-stream
Location: http://192.168.5.2:50075/webhdfs/v1/user/
szetszwo/w.txt?op=OPEN&offset=0
Content-Length: 0
Server: Jetty(6.1.26)




       Architecting the Future of Big Data   Page 12
Example: curl (2)

HTTP/1.1 200 OK
Content-Type: application/octet-stream
Content-Length: 21
Server: Jetty(6.1.26)

Hello, WebHDFS user!




     Architecting the Future of Big Data   Page 13
Example: wget
•  Use wget to open the same file

$wget "http://namenode:50070/webhdfs/v1/user/
szetszwo/w.txt?op=OPEN" –O w.txt

Resolving ...
Connecting to ... connected.
HTTP request sent, awaiting response...
307 TEMPORARY_REDIRECT
Location: http://192.168.5.2:50075/webhdfs/v1/user/
szetszwo/w.txt?op=OPEN&offset=0 [following]




       Architecting the Future of Big Data   Page 14
Example: wget (2)

--2012-06-13 01:42:10-- http://192.168.5.2:50075/
webhdfs/v1/user/szetszwo/w.txt?op=OPEN&offset=0
Connecting to 192.168.5.2:50075... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21 [application/octet-stream]
Saving to: `w.txt'

100%[=================>] 21                --.-K/s     in 0s

2012-06-13 01:42:10 (3.34 MB/s) - `w.txt' saved
[21/21]




     Architecting the Future of Big Data     Page 15
Example: Firefox




   Architecting the Future of Big Data   Page 16
HCatalog REST API
•  REST endpoints: databases, tables, partitions, columns, table properties
•  PUT to create/update, GET to list or describe, DELETE to drop
•  Uses JSON to describe metadata objects
•  Versioned, because we assume we will have to update it:
   http://hadoop.acme.com/templeton/v1/…
•  Runs in a Jetty server
•  Supports security
     –  Authentication done via kerberos using SPNEGO
•  Included in HDP, runs on Thrift metastore server machine
•  Not yet checked in, but you can find the code on Apache’s JIRA
   HCATALOG-182




         © 2012 Hortonworks
                                                                              Page 17
HCatalog REST API
                          Get a list of all tables in the default database:




           GET
           http://…/v1/ddl/database/default/table
                                                                              Hadoop/
                                                                              HCatalog
           {
               "tables": ["counted","processed",],
               "database": "default"
           }



 Indicate user with URL parameter:
 http://…/v1/ddl/database/default/table?user.name=gates
 Actions authorized as indicated user

     © Hortonworks 2012
                                                                                         Page 18
HCatalog REST API
                        Create new table “rawevents”

         PUT
         {"columns": [{ "name": "url", "type": "string" },
                      { "name": "user", "type": "string"}],
          "partitionedBy": [{ "name": "ds", "type": "string" }]}

         http://…/v1/ddl/database/default/table/rawevents

                                                            Hadoop/
                                                            HCatalog
         {
             "table": "rawevents",
             "database": "default”
         }




   © Hortonworks 2012
                                                                       Page 19
HCatalog REST API
                        Describe table “rawevents”




         GET
         http://…/v1/ddl/database/default/table/rawevents
                                                           Hadoop/
                                                           HCatalog
         {
              "columns": [{"name": "url","type": "string"},
                          {"name": "user","type": "string"}],
              "database": "default",
              "table": "rawevents"
         }




   © Hortonworks 2012
                                                                      Page 20
Job Management
•  Includes APIs to submit and monitor jobs
•  Any files needed for the job first uploaded to HDFS via WebHDFS
   –  Pig and Hive scripts
   –  Jars, Python scripts, or Ruby scripts for UDFs
   –  Pig macros
•  Results from job stored to HDFS, can be retrieved via WebHDFS
•  User responsible for cleaning up output in HDFS
•  Job state information stored in ZooKeeper or HDFS




        © 2012 Hortonworks
                                                                     Page 21
Job Submission
•  Can submit MapReduce, Pig, and Hive jobs
•  POST parameters include
   –  script to run or HDFS file containing script/jar to run
   –  username to execute the job as
   –  optionally an HDFS directory to write results to (defaults to user’s home directory)
   –  optionally a URL to invoke GET on when job is done


              POST
              http://hadoop.acme.com/templeton/v1/pig
                                                                            Hadoop/
                                                                            HCatalog
              {"id": "job_201111111311_0012",…}




        © 2012 Hortonworks
                                                                                       Page 22
Find all Your Jobs
•  GET on queue returns all jobs belonging to the submitting user
•  Pig, Hive, and MapReduce jobs will be returned




              GET
              http://…/templeton/v1/queue?user.name=gates
                                                                    Hadoop/
                                                                    HCatalog
              {"job_201111111311_0008",
               "job_201111111311_0012"}




        © 2012 Hortonworks
                                                                               Page 23
Get Status of a Job
•  Doing a GET on jobid gets you information about a particular job
•  Can be used to poll to see if job is finished
•  Used after job is finished to get job information
•  Doing a DELETE on jobid kills the job




              GET
              http://…/templeton/v1/queue/job_201111111311_0012
                                                                      Hadoop/
                                                                      HCatalog
              {…, "percentComplete": "100% complete",
                  "exitValue": 0,…
                  "completed": "done"
               }




        © 2012 Hortonworks
                                                                                 Page 24
Future
•  Job management
   –  Job management APIs don’t belong in HCatalog
   –  Only there by historical accident
   –  Need to move them out to MapReduce framework
•  Authentication needs more options than kerberos
•  Integration with Oozie
•  Need a directory service
   –  Users should not need to connect to different servers for HDFS, HBase, HCatalog,
      Oozie, and job submission




        © 2012 Hortonworks
                                                                                  Page 25

More Related Content

What's hot

Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDB
MongoDB
 
Dancing with the elephant h base1_final
Dancing with the elephant   h base1_finalDancing with the elephant   h base1_final
Dancing with the elephant h base1_final
asterix_smartplatf
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
Hyunsik Choi
 

What's hot (20)

HCatalog Hadoop Summit 2011
HCatalog Hadoop Summit 2011HCatalog Hadoop Summit 2011
HCatalog Hadoop Summit 2011
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
 
HBaseCon 2015: Analyzing HBase Data with Apache Hive
HBaseCon 2015: Analyzing HBase Data with Apache  HiveHBaseCon 2015: Analyzing HBase Data with Apache  Hive
HBaseCon 2015: Analyzing HBase Data with Apache Hive
 
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseHBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
 
Simplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in HadoopSimplified Data Management And Process Scheduling in Hadoop
Simplified Data Management And Process Scheduling in Hadoop
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDB
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
August 2014 HUG : Hive 13 Security
August 2014 HUG : Hive 13 SecurityAugust 2014 HUG : Hive 13 Security
August 2014 HUG : Hive 13 Security
 
Dancing with the elephant h base1_final
Dancing with the elephant   h base1_finalDancing with the elephant   h base1_final
Dancing with the elephant h base1_final
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
 
Mar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBaseMar 2012 HUG: Hive with HBase
Mar 2012 HUG: Hive with HBase
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
 
Apache hive
Apache hiveApache hive
Apache hive
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 

Viewers also liked

Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
Adam Kawa
 

Viewers also liked (19)

Big data and its impact on SOA
Big data and its impact on SOABig data and its impact on SOA
Big data and its impact on SOA
 
Flume intro-100715
Flume intro-100715Flume intro-100715
Flume intro-100715
 
Inside Flume
Inside FlumeInside Flume
Inside Flume
 
Apache Hadoop In Theory And Practice
Apache Hadoop In Theory And PracticeApache Hadoop In Theory And Practice
Apache Hadoop In Theory And Practice
 
10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeFeb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
 
A Multi Colored YARN
A Multi Colored YARNA Multi Colored YARN
A Multi Colored YARN
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 

Similar to Web Services Hadoop Summit 2012

Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
Wes Floyd
 
Cosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationCosmos, Big Data GE Implementation
Cosmos, Big Data GE Implementation
FIWARE
 
Implementing Hadoop on a single cluster
Implementing Hadoop on a single clusterImplementing Hadoop on a single cluster
Implementing Hadoop on a single cluster
Salil Navgire
 

Similar to Web Services Hadoop Summit 2012 (20)

Future of HCatalog
Future of HCatalogFuture of HCatalog
Future of HCatalog
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
מיכאל
מיכאלמיכאל
מיכאל
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Cosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARECosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARE
 
Cosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationCosmos, Big Data GE Implementation
Cosmos, Big Data GE Implementation
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight Service
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin Leau
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Implementing Hadoop on a single cluster
Implementing Hadoop on a single clusterImplementing Hadoop on a single cluster
Implementing Hadoop on a single cluster
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 

More from Hortonworks

More from Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Recently uploaded

Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
FIDO Alliance
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهله
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 

Web Services Hadoop Summit 2012

  • 1. Web Services in Hadoop Nicholas Sze and Alan F. Gates @szetszwo, @alanfgates Page 1
  • 2. REST-ful API Front-door for Hadoop • Opens the door to languages other than Java • Thin clients via web services vs. fat-clients in gateway • Insulation from interface changes release to release HCatalog web interfaces MapReduce Pig Hive HCatalog External HDFS HBase Store © 2012 Hortonworks Page 2
  • 3. Not Covered in this Talk •  HttpFS (fka Hoop) – same API as WebHDFS but proxied •  Stargate – REST API for HBase © 2012 Hortonworks Page 3
  • 4. HDFS Clients • DFSClient: the native client – High performance (using RPC) – Java blinding • libhdfs: a C++ client interface – Using JNI => large overhead – Also Java blinding (require Hadoop installing) Architecting the Future of Big Data Page 4
  • 5. HFTP • Designed for cross-version copying (DistCp) – High performance (using HTTP) – Read-only – The HTTP API is proprietary – Clients must use HftpFileSystem (hftp://) • WebHDFS is a rewrite of HFTP Architecting the Future of Big Data Page 5
  • 6. Design Goals • Support a public HTTP API • Support Read and Write • High Performance • Cross-version • Security Architecting the Future of Big Data Page 6
  • 7. WebHDFS features • HTTP REST API – Defines a public API – Permits non-Java client implementation – Support common tools like curl/wget • Wire Compatibility – The REST API will be maintained for wire compatibility – WebHDFS clients can talk to different Hadoop versions. Architecting the Future of Big Data Page 7
  • 8. WebHDFS features (2) • A Complete HDFS Interface – Support all user operations – reading files – writing to files – mkdir, chmod, chown, mv, rm, … • High Performance – Using HTTP redirection to provide data locality – File read/write are redirected to the corresponding datanodes Architecting the Future of Big Data Page 8
  • 9. WebHDFS features (3) • Secure Authentication – Same as Hadoop authentication: Kerberos (SPNEGO) and Hadoop delegation tokens – Support proxy users • A HDFS Built-in Component – WebHDFS is a first class built-in component of HDFS. – Run inside Namenodes and Datanodes • Apache Open Source – Available in Apache Hadoop 1.0 and above. Architecting the Future of Big Data Page 9
  • 10. WebHDFS URI & URL • FileSystem scheme: webhdfs:// • FileSystem URI: webhdfs://<HOST>:<HTTP_PORT>/<PATH> • HTTP URL: http://<HOST>:<HTTP_PORT>/webhdfs/v1/<PATH>?op=.. – Path prefix: /webhdfs/v1 – Query: ?op=.. Architecting the Future of Big Data Page 10
  • 11. URI/URL Examples •  Suppose we have the following file hdfs://namenode:8020/user/szetszwo/w.txt •  WebHDFS FileSystem URI webhdfs://namenode:50070/user/szetszwo/w.txt •  WebHDFS HTTP URL http://namenode:50070/webhdfs/v1/user/ szetszwo/w.txt?op=.. •  WebHDFS HTTP URL to open the file http://namenode:50070/webhdfs/v1/user/ szetszwo/w.txt?op=OPEN Architecting the Future of Big Data Page 11
  • 12. Example: curl •  Use curl to open a file $curl -i -L "http://namenode:50070/webhdfs/v1/user/ szetszwo/w.txt?op=OPEN" HTTP/1.1 307 TEMPORARY_REDIRECT Content-Type: application/octet-stream Location: http://192.168.5.2:50075/webhdfs/v1/user/ szetszwo/w.txt?op=OPEN&offset=0 Content-Length: 0 Server: Jetty(6.1.26) Architecting the Future of Big Data Page 12
  • 13. Example: curl (2) HTTP/1.1 200 OK Content-Type: application/octet-stream Content-Length: 21 Server: Jetty(6.1.26) Hello, WebHDFS user! Architecting the Future of Big Data Page 13
  • 14. Example: wget •  Use wget to open the same file $wget "http://namenode:50070/webhdfs/v1/user/ szetszwo/w.txt?op=OPEN" –O w.txt Resolving ... Connecting to ... connected. HTTP request sent, awaiting response... 307 TEMPORARY_REDIRECT Location: http://192.168.5.2:50075/webhdfs/v1/user/ szetszwo/w.txt?op=OPEN&offset=0 [following] Architecting the Future of Big Data Page 14
  • 15. Example: wget (2) --2012-06-13 01:42:10-- http://192.168.5.2:50075/ webhdfs/v1/user/szetszwo/w.txt?op=OPEN&offset=0 Connecting to 192.168.5.2:50075... connected. HTTP request sent, awaiting response... 200 OK Length: 21 [application/octet-stream] Saving to: `w.txt' 100%[=================>] 21 --.-K/s in 0s 2012-06-13 01:42:10 (3.34 MB/s) - `w.txt' saved [21/21] Architecting the Future of Big Data Page 15
  • 16. Example: Firefox Architecting the Future of Big Data Page 16
  • 17. HCatalog REST API •  REST endpoints: databases, tables, partitions, columns, table properties •  PUT to create/update, GET to list or describe, DELETE to drop •  Uses JSON to describe metadata objects •  Versioned, because we assume we will have to update it: http://hadoop.acme.com/templeton/v1/… •  Runs in a Jetty server •  Supports security –  Authentication done via kerberos using SPNEGO •  Included in HDP, runs on Thrift metastore server machine •  Not yet checked in, but you can find the code on Apache’s JIRA HCATALOG-182 © 2012 Hortonworks Page 17
  • 18. HCatalog REST API Get a list of all tables in the default database: GET http://…/v1/ddl/database/default/table Hadoop/ HCatalog { "tables": ["counted","processed",], "database": "default" } Indicate user with URL parameter: http://…/v1/ddl/database/default/table?user.name=gates Actions authorized as indicated user © Hortonworks 2012 Page 18
  • 19. HCatalog REST API Create new table “rawevents” PUT {"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]} http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "table": "rawevents", "database": "default” } © Hortonworks 2012 Page 19
  • 20. HCatalog REST API Describe table “rawevents” GET http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "columns": [{"name": "url","type": "string"}, {"name": "user","type": "string"}], "database": "default", "table": "rawevents" } © Hortonworks 2012 Page 20
  • 21. Job Management •  Includes APIs to submit and monitor jobs •  Any files needed for the job first uploaded to HDFS via WebHDFS –  Pig and Hive scripts –  Jars, Python scripts, or Ruby scripts for UDFs –  Pig macros •  Results from job stored to HDFS, can be retrieved via WebHDFS •  User responsible for cleaning up output in HDFS •  Job state information stored in ZooKeeper or HDFS © 2012 Hortonworks Page 21
  • 22. Job Submission •  Can submit MapReduce, Pig, and Hive jobs •  POST parameters include –  script to run or HDFS file containing script/jar to run –  username to execute the job as –  optionally an HDFS directory to write results to (defaults to user’s home directory) –  optionally a URL to invoke GET on when job is done POST http://hadoop.acme.com/templeton/v1/pig Hadoop/ HCatalog {"id": "job_201111111311_0012",…} © 2012 Hortonworks Page 22
  • 23. Find all Your Jobs •  GET on queue returns all jobs belonging to the submitting user •  Pig, Hive, and MapReduce jobs will be returned GET http://…/templeton/v1/queue?user.name=gates Hadoop/ HCatalog {"job_201111111311_0008", "job_201111111311_0012"} © 2012 Hortonworks Page 23
  • 24. Get Status of a Job •  Doing a GET on jobid gets you information about a particular job •  Can be used to poll to see if job is finished •  Used after job is finished to get job information •  Doing a DELETE on jobid kills the job GET http://…/templeton/v1/queue/job_201111111311_0012 Hadoop/ HCatalog {…, "percentComplete": "100% complete", "exitValue": 0,… "completed": "done" } © 2012 Hortonworks Page 24
  • 25. Future •  Job management –  Job management APIs don’t belong in HCatalog –  Only there by historical accident –  Need to move them out to MapReduce framework •  Authentication needs more options than kerberos •  Integration with Oozie •  Need a directory service –  Users should not need to connect to different servers for HDFS, HBase, HCatalog, Oozie, and job submission © 2012 Hortonworks Page 25