SlideShare a Scribd company logo
1 of 47
AUG 05, 2016
Qubole - Big Data in Cloud
Kiryl Sultanau
2CONFIDENTIAL
BIG DATA CHALLENGES
3CONFIDENTIAL
BIG DATA BELONGS TO THE CLOUD
4CONFIDENTIAL
BIG DATA BELONGS TO THE CLOUD
5CONFIDENTIAL
BIG DATA BELONGS TO THE CLOUD
6CONFIDENTIAL
QUBOLE HISTORY
7CONFIDENTIAL
WHAT IS QUBOLE?
2014 Usage Statistics for Qubole on AWS:
• Total QCUH processed in 2014 = 40.6 million
• Total nodes managed in 2014 = 2.5 million
• Total PB processed in 2014 = 519
Operations
Analyst
Marketing
Ops
Analyst
Data
Architects
Business
Users
Product
Support
Customer
Support
Developer
Sales
Ops
Product
Managers
Developer
Tools
Service
Management
Data Workbench
Cloud Data Platform
BI & DW
Systems
• SDK
• API
• Analysis
• Security
• Job Scheduler
• Data Governance
• Analytics
templates
• Monitoring
• Support
• Collaboration
• Workflow &
Map/Reduce
• Auto Scaling
• Cloud Optimization
• Data Connectors• YARN • Presto & Hive• Spark & Pig
Hadoop Ecosystem (Apache Open Source)
8CONFIDENTIAL
QDS Cluster Types
9CONFIDENTIAL
CURRENTLY SUPPORTED QDS COMPONENTS
QDS Component Currently Supported Versions
Cascading Compatible with all versions
Hadoop 1 0.20.1
Hadoop 2 2.6.0
HBase 1.0
Hive 0.13.1 and 1.2
MapReduce 0.20.1 and 2.6.0
Pig 0.11 and 0.15
Presto 0.142
Spark 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.6.0, 1.6.1
Sqoop 0.20.1 ??? (No)
Tez 0.7
Zeppelin (notebooks) 1.0 ??? (0.6 or 0.5.6)
10CONFIDENTIAL
QDS INSTANCE SELECTION
11CONFIDENTIAL
CENTRALIZED HIVE METASTORE
12CONFIDENTIAL
INTERCLUSTER METASTORE
13CONFIDENTIAL
QUBOLE COMMUNICATION AND SECURITY
14CONFIDENTIAL
RIGHT TOOL FOR RIGHT WORKLOAD
Large scale ETL
Interactive
Discovery Queries
Machine Learning/Real
time queries
High Performance DW
Queries/Reporting
backend
15CONFIDENTIAL
TIPS & TRICKS
S3 is a default storage1
HDFS is temporary storage2
Cluster start time ≈ 90 sec3
Be ready to eventual consistency issue4
Multiple clusters are optimized for different workloads5
Use spot instances6
Use Metastore API7
Cluster restart is useful8
Use Quark if possible9
16CONFIDENTIAL
3rd Party Services
StreamX
RubiX
Quark
Connectors & Handlers
Airflow
17CONFIDENTIAL
STREAMX: KAFKA CONNECT FOR S3
StreamX is a kafka-connect based connector to copy data from Kafka to Object Stores like Amazon s3,
Google Cloud Storage and Azure Blob Store. It focusses on reliable and scalable data copying. One design
goal is to write the data in format (like parquet), so that it can readily be used by analytical tools.
Features:
• Support for writing data in Avro and Parquet formats.
• Provides Hive Integration where the connector creates partitioned hive table and periodically adds
partitions once it is written in new partition on S3
• Pluggable partitioner :
• default partitioner : Each Kafka partition will have its data copied under a partition specific
directory
• time based partitioner : Ability to write data on hourly basis
• field based partitioner : Ability to use a field in the record as custom partitioner
• Exactly-once guarantee using WAL
• Direct output to S3 (Avoid writing to temporary file and renaming it.
• Support for storing Hive tables in Qubole's hive metastore (coming soon)
18CONFIDENTIAL
STREAMX: KAFKA CONNECT FOR S3
Configuration (core-site.xml):
• fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3a.S3A
• fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
• fs.s3a.access.key=xxxxx
• fs.s3a.secret.key=xxxxx
Sample
Run connect-distributed in Kafka : bin/connect-distributed.sh config/connect-distributed.xml
{"name":"twitter connector",
"config":{ "name":"twitter connector",
"connector.class":"com.qubole.streamx.s3.S3SinkConnector",
"tasks.max":"1", "flush.size":"100000", "s3.url":"<S3 location>",
"wal.class":"com.qubole.streamx.s3.wal.DBWAL",
"db.connection.url":"<jdbc:mysql://localhost:3306/kafka>",
"db.user":"<username_required>", "db.password":"<password_required>",
"hadoop.conf.dir":"<directory where hadoop conf files are stored. Example
/usr/lib/hadoop2/etc/hadoop/>",
"topics":"twitter1p"}}
19CONFIDENTIAL
RUBIX: LIGHT-WEIGHT DATA CACHING FRAMEWORK
RubiX is a light-weight data caching framework that can be used by Big-Data engines. RubiX can be
extended to support any engine that accesses data in cloud stores using Hadoop FileSystem interface via
plugins. Using the same plugins, RubiX can also be extended to be used with any cloud store
Supported Engines and Cloud Stores:
• Presto: Amazon S3 is supported
• Hadoop-1: Amazon S3 is supported
How to use it:
RubiX has two components: a BookKeeper server and a FileSystem implementation for engine to use.
Start the BookKeeper server. It can be started via hadoop jar command, e.g.:
hadoop jar rubix-bookkeeper-1.0.jar com.qubole.rubix.bookkeeper.BookKeeperServer
Engine side changes:
To use RubiX, you need to place the appropriate jars in the classpath and configure Engines to use RubiX
filesystem to access the cloud store. Sections below show how to get started on RubiX with supported
plugins
20CONFIDENTIAL
QUARK: COST-BASED SQL OPTIMIZER
Quark optimizes access to data by managing relationships between tables across all databases in an
organization. Quark defines materialized views and olap cubes, using them to route queries between
tables stored in different databases. Quark is distributed as a JDBC jar and will work with most tools that
integrate through JDBC.
Create & manage optimized copies of base tables:
• Narrow tables with important attributes only.
• Sorted tables to speed up filters, joins and aggregation
• Denormalized tables wherein tables in a snowflake schema have been joined.
OLAP Cubes: Quark supports OLAP cubes on partial data (last 3 months of sales reports for example). It
also supports incremental refresh.
Bring your own database: Quark enables you to choose the technology stack. For example, optimized
copies or OLAP cubes can be stored in Redshift or RDB and base tables can be in S3 or HDFS and accessed
through Hive.
Rewrite common bad queries: A common example is to miss specifying partition columns which leads to a
full table scan. Quark can infer the predicates on partition columns if there are related columns or
enforce a policy to limit the data scanned.
21CONFIDENTIAL
QUARK: COST-BASED SQL OPTIMIZER
Administration: Database administrators are expected to register datasources, define views and cubes.
Quark can pull metadata from a multitude of data sources through an extensible Plugin interface. Once
the data sources are registered, the tables are referred to as data_source.schema_name.table_name.
Quark adds an extra namespace to avoid conflicts. DBAs can define or alter materialized views with DDL
statements such as:
Internals: Quark’s capabilities are similar to a database optimizer. Internally it uses Apache Calcite which
is a cost-based optimizer. It uses the Avatica a sub-project of Apache Calcite to implement the JDBC client
and server.
Quark parses, optimizes and routes the query to the most optimal dataset. For example, if the last months
of data in hive.default.page_views are in a data warehouse, Quark will execute queries in the data
warehouse instead of the table in Apache Hive.
create view page_views_partition as
select * from hive.default.page_views
where timestamp between “Mar 1 2016”
and “Mar 7 2016”
and group in (“en”, “fr”, “de”)
stored in
data_warehouse.public.pview_partition
22CONFIDENTIAL
QDS: CONNECTORS & HANDLERS
Hive Storage Handler for Kinesis helps users read from and write to Kinesis Streams using Hive, enabling
them to run queries to analyze data that resides in Kinesis.
Hive Storage Handler for JDBC is a fork of HiveJdbcStorageHandler, helps users read from and write to
JDBC databases using Hive, and also enabling them to run SQL queries to analyze data that resides in JDBC
tables. Optimizations such as FilterPushDown have also been added.
CREATE TABLE TransactionLogs (
transactionId INT,
username STRING,
amount INT )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED BY 'com.qubole.hive.kinesis.HiveKinesisStorageHandler'
TBLPROPERTIES ('kinesis.stream.name'='TransactionStream');
CREATE EXTERNAL TABLE HiveTable(id INT, names STRING)
STORED BY 'org.apache.hadoop.hive.jdbc.storagehandler.JdbcStorageHandler'
TBLPROPERTIES ( "mapred.jdbc.driver.class"="com.mysql.jdbc.Driver",
"mapred.jdbc.url"="jdbc:mysql://localhost:3306/rstore",
"mapred.jdbc.username"="root", "mapred.jdbc.input.table.name"="JDBCTable",
"mapred.jdbc.output.table.name"="JDBCTable", "mapred.jdbc.password"="",
"mapred.jdbc.hive.lazy.split"= "false");
23CONFIDENTIAL
QDS: CONNECTORS & HANDLERS
Kinesis Connector for Presto allows the use of Kinesis streams as tables in Presto, such that each data-
blob (message) in a kinesis stream is presented as a row in Presto. A flexible table mapping approach lets
us treat fields of the messages as columns in the table.
{ "tableName": "test_table",
"schemaName": "otherworld",
"streamName": "test_kinesis_stream", "message":
{
"dataFormat": "json",
"fields": [
{ "name": "client_id",
"type": "BIGINT",
"mapping": "body/profile/clientId",
"comment": "The client ID field" },
{ "name": "routing_time",
"mapping": "header/routing_time",
"type": "DATE",
"dataFormat": "iso8601" } ]
} }
24CONFIDENTIAL
AIRFLOW AUTHOR, SCHEDULE AND MONITOR WORKFLOWS
Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes
your tasks on an array of workers while following the specified dependencies. Rich command line utilities
make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize
pipelines running in production, monitor progress, and troubleshoot issues when needed.
Principles:
• Dynamic - Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline
generation. This allows for writing code that instantiates pipelines dynamically.
• Extensible - Easily define your own operators, executors and extend the library so that it fits the level
of abstraction that suits your environment.
• Elegant - Airflow pipelines are lean and explicit. Parameterizing your scripts is built into the core of
Airflow using the powerful Jinja templating engine.
• Scalable - Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary
number of workers. Airflow is ready to scale to infinity.
25CONFIDENTIAL
QDS APIs & SDKs
Qubole REST API
Qubole Python SDK
Qubole Java SDK
26CONFIDENTIAL
QDS REST APIs
The Qubole Data Service (QDS) is accessible via REST APIs.
Access URL: https://api.qubole.com/api/${V}/ where ${V} version of the API (v1.2 … latest).
Authentication: Qubole API Token set API token to the AUTH_TOKEN environment variable.
API Types:
•Account API
•Cluster API
•Command Template API
•DbTap API
•Groups API
•Hive Metadata API
•Reports API
•Roles API
•Scheduler API
•Users API
•Command API
• Hive commands
• Hadoop jobs
• Pig commands
• Presto commands
• Spark commands
• DbImport commands
• DbExport commands
• Shell commands
27CONFIDENTIAL
QDS HIVE METADATA API
QDS Hive Metadata API Types:
•Schema or Database
•Get Table Definition
•Store Table Properties
•Get Table Properties
•Delete Table Properties
Get Table Properties:
Store Table Properties:
curl -i -X GET -H "Accept: application/json" -H "Content-type: application/json" -H "X-AUTH-TOKEN:
$AUTH_TOKEN" "https://api.qubole.com/api/v1.2/hive/default/daily_tick/table_properties"
cat pl.json
{ "interval": "1",
"interval_unit": "days",
"columns": {
"stock_exchange": "",
"stock_symbol": "",
"year": "%Y",
"date": "%Y-%m-%d" }
}
curl -i -X POST -H "Accept:application/json" -H "Content-type:application/json“ -H "X-AUTH-TOKEN:
$AUTH_TOKEN" --data @pl.json https://api.qubole.com/api/v1.2/hive/default/daily_tick/properties
{
"location": "s3n://paid-qubole/data/stock_tk",
"owner": "ec2-user",
"create-time": 1362388416,
"table-type": "EXTERNAL_TABLE",
"field.delim": ",",
"serialization.format": ","
}
28CONFIDENTIAL
QDS HIVE COMMAND API
This API is used to submit a Hive query.
Parameter Description
query Specify Hive query to run. Either query or script_location is required
script_location
Specify a S3 path where the hive query to run is stored. Either query or script_location is required. AWS storage
credentials stored in the account are used to retrieve the script file.
command_type Hive command
label Cluster label to specify the cluster to run this command
retry Denotes the number of retries for a job. Valid values of retry are 1, 2, and 3.
macros Expressions to evaluate macros used in the hive command. Refer to Macros in Scheduler for more details.
sample_size Size of sample in bytes on which to run the query for test mode.
approx_mode_progress Value of progress for constrained run. Valid value of float between 0 and 1
approx_mode_max_rt Constrained run max runtime in seconds
approx_mode_min_rt Constrained run min runtime in seconds
approx_aggregations Convert count distinct to count approx. Valid values are bool or NULL
name
Add a name to the command that is useful while filtering commands from the command history. It does not
accept & (ampersand), < (lesser than), > (greater than), ” (double quotes), and ‘ (single quote) special
characters, and HTML tags as well. It can contain a maximum of 255 characters.
tag
Add a tag to a command so that it is easily identifiable and searchable from Commands History. Add a tag as a
filter value while searching commands. Max 255 characters. Comma is separator for several tags.
29CONFIDENTIAL
QDS HIVE COMMAND API
Count the number of rows in the table:
Response:
export QUERY="select count(*) as num_rows from miniwikistats;"
curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H
"Accept: application/json" -d '{ "query":"$QUERY", "command_type": "HiveCommand" }'
"https://api.qubole.com/api/${V}/commands/"
HTTP/1.1 200 OK
{ "command": {
"approx_mode": false,
"approx_aggregations": false,
"query": "select count(*) as num_rows from miniwikistats;",
"sample": false },
"qbol_session_id": 0000,
…
"progress": 0,
"meta_data": {
"results_resource": "commands/3852/results",
"logs_resource": "commands/3852/logs" } }
30CONFIDENTIAL
QDS HIVE COMMAND API
Run a query stored in a S3 file location:
Run a parameterized query stored in a S3 file location:
Submitting a Hive Query to a Specific Cluster:
cat payload.json
{ "script_location":"<S3 Path>",
"command_type": "HiveCommand" }
curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H
"Accept: application/json" -d @payload "https://api.qubole.com/api/${V}/commands/"
cat payload.json
{ "script_location":"<S3 Path>",
"macros":[
{"date":"moment('2011-01-11T00:00:00+00:00')"},
{"formatted_date":"date.clone().format('YYYY-MM-DD')"}],
"command_type": "HiveCommand" }
curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H
"Accept: application/json" -d @payload "https://api.qubole.com/api/${V}/commands/"
curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H
"Accept: application/json" -d '{"query":"show tables;", "label":"HadoopCluster",
"command_type": "HiveCommand"}' "https://api.qubole.com/api/${V}/commands"
31CONFIDENTIAL
QDS SPARK COMMAND API
This API is used to submit a Spark command.
Parameter Description
program Provide the complete Spark Program in Scala, SQL, Command, R, or Python.
language Specify the language of the program, Scala, SQL, Command or Python. Required only when a program is used.
arguments Specify the spark-submit command line arguments here.
user_program_
arguments
Specify the arguments that the user program takes in.
cmdline
Alternatively, you can provide the spark-submit command line itself. If you use this option, you cannot use any other
parameters mentioned here. All required information is captured in command line itself.
command_type Spark command
label Cluster label to specify the cluster to run this command
app_id
ID of an app, which is a main abstraction of the Spark Job Server API. An app is used to store the configuraton for a
Spark application. See Understanding the Spark Job Server for more information.
name
Add a name to the command that is useful while filtering commands from the command history. It does not accept &
(ampersand), < (lesser than), > (greater than), ” (double quotes), and ‘ (single quote). Max 255 characters.
tag
Add a tag to a command so that it is easily identifiable and searchable from Commands History. Add a tag as a filter
value while searching commands. Max 255 characters. Comma is separator for several tags.
32CONFIDENTIAL
QDS SPARK COMMAND API
Example Python API Framework:
Example to Submit Spark Scala Program:
import sys, pycurl, json
c= pycurl.Curl()
url="https://api.qubole.com/api/v1.2/commands"
auth_token = <provide auth token here>
c.setopt(pycurl.URL, url)
c.setopt(pycurl.HTTPHEADER, ["X-AUTH-TOKEN: "+ auth_token, "Content-Type:application/json", "Accept: application/json"])
c.setopt(pycurl.POST,1)
prog = '''
import scala.math.random
import org.apache.spark._
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Pi")
val spark = new SparkContext(conf)
val slices = if (args.length > 0) args(0).toInt else 2
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
}
}
'''
data=json.dumps({"program":prog,"language":"scala","arguments":"--class SparkPi", command_type":"SparkCommand"})
c.setopt(pycurl.POSTFIELDS, data)
c.perform()
33CONFIDENTIAL
QDS SPARK COMMAND API
Example to Submit Spark Command in SQL:
Example to Submit a Spark Command in SQL to a Spark Job Server App:
Where app_id = Spark Job Server app ID. See Understanding the Spark Job Server for more
information.
curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -
H "Accept: application/json" -d ‘{ "sql":"select * from default_qubole_memetracker
limit 10;", "language":"sql","command_type":"SparkCommand", "label":"spark" }'
"https://api.qubole.com/api/${V}/commands"
curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -
H "Accept: application/json" -d '{ "sql":"select * from default_qubole_memetracker
limit 10;", "language":"sql","command_type":"SparkCommand",
"label":"spark","app_id":"3" }' "https://api.qubole.com/api/${V}/commands"
34CONFIDENTIAL
QDS SCHEDULE API
This API creates a new schedule to run commands automatically at certain frequency.
Parameter Description
command_type A valid command type supported by Qubole. For example, HiveCommand, HadoopCommand, PigCommand.
command
JSON object describing the command. Refer to the Command API for more details.
Sub fields can use macros. Refer to the Qubole Scheduler for more details.
start_time Start datetime for the schedule
end_time End datetime for the schedule
frequency
Specify how often the job should run. Input is an integer. For example, frequency of one hour/day/month is
represented as {"frequency":"1"}
time_unit
Denotes the time unit for the frequency. Its default value is days. Accepted value is minutes, hours, days, weeks, or
months.
… …
concurrency Specify how many job actions can run at a time. Default value is 1.
dependency_info
Describe dependencies for this job.
Check the Hive Datasets as Schedule Dependency for more information.
Notification
Parameters
It is an optional parameter that is set to false by default. You can set it to true if you want to be notified through
email about instance failure. Notification Parameters provides more information.
35CONFIDENTIAL
QDS SCHEDULE API
The query shown below aggregates the data for every stock symbol, every day:
curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H
"Content-type: application/json" -d
'{ "command_type":"HiveCommand",
"command": {
"query": "select stock_symbol, max(high), min(low), sum(volume)
from daily_tick_data
where date1='$formatted_date$'
group by stock_symbol" },
"macros": [ { "formatted_date": "Qubole_nominal_time.format('YYYY-MM-DD')" } ],
notification:{"is_digest": false, "notification_email_list":["user@qubole.com"],
"notify_failure": true, "notify_success": false}`
"start_time": "2012-07-01T02:00Z",
"end_time": "2022-07-01T02:00Z",
"frequency": "1",
"time_unit": "days", "time_out":"10",
"dependency_info": {} }'
"https://api.qubole.com/api/v1.2/scheduler"
36CONFIDENTIAL
QUBOLE DATA SERVICE JAVA SDK
A Java library that provides the tools you need to authenticate with and use the Qubole API.
Installation
Usage
Allocate a QdsClient object:
Then, make api calls as needed…
<dependency>
<groupId>com.qubole.qds-sdk-java</groupId>
<artifactId>qds-sdk-java</artifactId>
<version>0.7.0</version>
</dependency>
QdsConfiguration configuration = new DefaultQdsConfiguration(YOUR_API_KEY);
QdsClient client = QdsClientFactory.newClient(configuration);
37CONFIDENTIAL
QUBOLE DATA SERVICE JAVA SDK
API call:
API call (with Jersey's callback mechanism):
Future<CommandResponse> hiveCommandResponseFuture = client
.command().hive().query("show tables;").invoke();
CommandResponse hiveCommandResponse = hiveCommandResponseFuture.get();
InvocationCallback<CommandResponse> callback = new InvocationCallback<CommandResponse>()
{
@Override
public void completed(CommandResponse clusterItems)
{
// ...
}
@Override
public void failed(Throwable throwable)
{
// ...
}
};
client.command().hive().query("show tables;").withCallback(callback).invoke();
...
38CONFIDENTIAL
QUBOLE DATA SERVICE JAVA SDK
Waiting for Results (Blocking):
Waiting for Results (with callback):
Paging:
ResultLatch latch = new ResultLatch(client, queryId);
ResultValue = latch.awaitResult();
ResultLatch.Callback callback = new ResultLatch.Callback()
{
@Override
public void result(String queryId, ResultValue resultValue)
{// use results
}
@Override
public void error(String queryId, Exception e)
{// handle error
}
};
ResultLatch latch = new ResultLatch(client, queryId);
latch.callback(callback);
// return page 2 using 3 per page
client.command().history().forPage(2, 3).invoke();
39CONFIDENTIAL
QUBOLE DATA SERVICE PYTHON SDK
A Python module that provides the tools you need to authenticate with and use the Qubole API.
Installation
or
CLI Usage (qds.py allows running Hive, Hadoop, Pig, Presto and Shell commands against QDS. Users can run
commands synchronously - or submit a command and check its status):
Pass in api token from bash environment variable
$ pip install qds-sdk $ python setup.py install
$ qds.py --token 'xxyyzz' hivecmd run --query "show tables"
$ qds.py --token 'xxyyzz' hivecmd run --script_location /tmp/myquery
$ qds.py --token 'xxyyzz' hivecmd run --script_location s3://my-qubole-location/myquery
$ export QDS_API_TOKEN=xxyyzz
$ qds.py hadoopcmd run streaming -files 's3n://paid-
qubole/HadoopAPIExamples/WordCountPython/mapper.py,s3n://paid-
qubole/HadoopAPIExamples/WordCountPython/reducer.py' -mapper mapper.py -reducer reducer.py -
numReduceTasks 1 -input 's3n://paid-qubole/default-datasets/gutenberg' -output
's3n://example.bucket.com/wcout'
$ qds.py hivecmd check 12345678 {"status": "done", ... }
40CONFIDENTIAL
QUBOLE DATA SERVICE JAVA SDK
Programmatic Usage (Python application needs to do the following):
1) Set the api_token:
2) Use the Command classes defined in commands.py to execute commands. To run Hive cmd:
from qds_sdk.qubole import Qubole
Qubole.configure(api_token='ksbdvcwdkjn123423')
from qds_sdk.commands import *
hc=HiveCommand.create(query='show tables')
print "Id: %s, Status: %s" % (str(hc.id), hc.status)
41CONFIDENTIAL
QDS QUICK TOUR
42CONFIDENTIAL
QDS: MAIN NAVIGATION
43CONFIDENTIAL
QDS: PRODUCT OFFERINGS
44CONFIDENTIAL
QDS: EXPLORE AND IMPORT DATA
45CONFIDENTIAL
QDS: ANALYZE DATA
46CONFIDENTIAL
QDS: CLUSTER SETTINGS
47CONFIDENTIAL
QDS: Dashboard

More Related Content

What's hot

Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Holden Ackerman
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveQubole
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringAnant Corporation
 
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Michael Rys
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Lace Lofranco
 
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...✔ Eric David Benari, PMP
 
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannAzure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannDatabricks
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleVasu S
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Attunity Solutions for Teradata
Attunity Solutions for TeradataAttunity Solutions for Teradata
Attunity Solutions for TeradataAttunity
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azureDavid Giard
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2inovex GmbH
 
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaAzure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaDatabricks
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkMatt Ingenthron
 

What's hot (20)

Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
 
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
 
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
 
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannAzure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
 
Digital Transformation with Microsoft Azure
Digital Transformation with Microsoft AzureDigital Transformation with Microsoft Azure
Digital Transformation with Microsoft Azure
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Attunity Solutions for Teradata
Attunity Solutions for TeradataAttunity Solutions for Teradata
Attunity Solutions for Teradata
 
Interactive query using hadoop
Interactive query using hadoopInteractive query using hadoop
Interactive query using hadoop
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
 
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaAzure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 

Viewers also liked

Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup Qubole
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Joydeep Sen Sarma
 
Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...yalisassoon
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataQubole
 
5 Crucial Considerations for Big data adoption
5 Crucial Considerations for Big data adoption5 Crucial Considerations for Big data adoption
5 Crucial Considerations for Big data adoptionQubole
 
Nw qubole overview_033015
Nw qubole overview_033015Nw qubole overview_033015
Nw qubole overview_033015Michael Mersch
 
Unlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWSUnlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWSAmazon Web Services
 
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature MappingMicrosoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature MappingIlyas F ☁☁☁
 
Informatica Big Data Edition - Profinit - Jan Ulrych
Informatica Big Data Edition - Profinit - Jan UlrychInformatica Big Data Edition - Profinit - Jan Ulrych
Informatica Big Data Edition - Profinit - Jan UlrychProfinit
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on CloudQubole
 
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...Data Con LA
 
Database Camp 2016 @ United Nations, NYC - Javier de la Torre, CEO, CARTO
Database Camp 2016 @ United Nations, NYC - Javier de la Torre, CEO, CARTODatabase Camp 2016 @ United Nations, NYC - Javier de la Torre, CEO, CARTO
Database Camp 2016 @ United Nations, NYC - Javier de la Torre, CEO, CARTO✔ Eric David Benari, PMP
 
Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7mmathipra
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages JaunesDataiku
 
Database Camp 2016 @ United Nations, NYC - Amir Orad, CEO, Sisense
Database Camp 2016 @ United Nations, NYC - Amir Orad, CEO, SisenseDatabase Camp 2016 @ United Nations, NYC - Amir Orad, CEO, Sisense
Database Camp 2016 @ United Nations, NYC - Amir Orad, CEO, Sisense✔ Eric David Benari, PMP
 
Keen IO Presents at Under the Radar 2013
Keen IO Presents at Under the Radar 2013Keen IO Presents at Under the Radar 2013
Keen IO Presents at Under the Radar 2013Dealmaker Media
 
Petit Club "Le Commerce On/Off" - Présentation d'Alkemics
Petit Club "Le Commerce On/Off" - Présentation d'AlkemicsPetit Club "Le Commerce On/Off" - Présentation d'Alkemics
Petit Club "Le Commerce On/Off" - Présentation d'AlkemicsPetit Web
 
Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2Cdiscount
 

Viewers also liked (20)

Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 
Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big Data
 
5 Crucial Considerations for Big data adoption
5 Crucial Considerations for Big data adoption5 Crucial Considerations for Big data adoption
5 Crucial Considerations for Big data adoption
 
Nw qubole overview_033015
Nw qubole overview_033015Nw qubole overview_033015
Nw qubole overview_033015
 
Unlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWSUnlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWS
 
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature MappingMicrosoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
 
Big dataanalyticsinthecloud
Big dataanalyticsinthecloudBig dataanalyticsinthecloud
Big dataanalyticsinthecloud
 
Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 
Informatica Big Data Edition - Profinit - Jan Ulrych
Informatica Big Data Edition - Profinit - Jan UlrychInformatica Big Data Edition - Profinit - Jan Ulrych
Informatica Big Data Edition - Profinit - Jan Ulrych
 
Running Spark on Cloud
Running Spark on CloudRunning Spark on Cloud
Running Spark on Cloud
 
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
 
Database Camp 2016 @ United Nations, NYC - Javier de la Torre, CEO, CARTO
Database Camp 2016 @ United Nations, NYC - Javier de la Torre, CEO, CARTODatabase Camp 2016 @ United Nations, NYC - Javier de la Torre, CEO, CARTO
Database Camp 2016 @ United Nations, NYC - Javier de la Torre, CEO, CARTO
 
Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
 
Database Camp 2016 @ United Nations, NYC - Amir Orad, CEO, Sisense
Database Camp 2016 @ United Nations, NYC - Amir Orad, CEO, SisenseDatabase Camp 2016 @ United Nations, NYC - Amir Orad, CEO, Sisense
Database Camp 2016 @ United Nations, NYC - Amir Orad, CEO, Sisense
 
Keen IO Presents at Under the Radar 2013
Keen IO Presents at Under the Radar 2013Keen IO Presents at Under the Radar 2013
Keen IO Presents at Under the Radar 2013
 
Petit Club "Le Commerce On/Off" - Présentation d'Alkemics
Petit Club "Le Commerce On/Off" - Présentation d'AlkemicsPetit Club "Le Commerce On/Off" - Présentation d'Alkemics
Petit Club "Le Commerce On/Off" - Présentation d'Alkemics
 
Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2
 

Similar to Qubole - Big data in cloud

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructuredatastack
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)James Serra
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseJames Serra
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Cask Data
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Michael Rys
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Bostonkbajda
 
Tech-Spark: SQL Server on Linux
Tech-Spark: SQL Server on LinuxTech-Spark: SQL Server on Linux
Tech-Spark: SQL Server on LinuxRalph Attard
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraVictor Coustenoble
 
Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -Aucfan
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Trivadis
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark Summit
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonClaudiu Barbura
 

Similar to Qubole - Big data in cloud (20)

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructure
 
Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)Azure Synapse Analytics Overview (r2)
Azure Synapse Analytics Overview (r2)
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Azure SQL
Azure SQLAzure SQL
Azure SQL
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
Best practices on Building a Big Data Analytics Solution (SQLBits 2018 Traini...
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Boston
 
Tech-Spark: SQL Server on Linux
Tech-Spark: SQL Server on LinuxTech-Spark: SQL Server on Linux
Tech-Spark: SQL Server on Linux
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)
 
Sas hpa-va-bda-exadata-2389280
Sas hpa-va-bda-exadata-2389280Sas hpa-va-bda-exadata-2389280
Sas hpa-va-bda-exadata-2389280
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
 

More from Dmitry Tolpeko

Big Data Analytics for BI, BA and QA
Big Data Analytics for BI, BA and QABig Data Analytics for BI, BA and QA
Big Data Analytics for BI, BA and QADmitry Tolpeko
 
Epam BI - Near Realtime Marketing Support System
Epam BI - Near Realtime Marketing Support SystemEpam BI - Near Realtime Marketing Support System
Epam BI - Near Realtime Marketing Support SystemDmitry Tolpeko
 
Big Data Technology - Solit 2015 Conference
Big Data Technology - Solit 2015 ConferenceBig Data Technology - Solit 2015 Conference
Big Data Technology - Solit 2015 ConferenceDmitry Tolpeko
 
Apache Yarn - Hadoop Cluster Management
Apache Yarn -  Hadoop Cluster ManagementApache Yarn -  Hadoop Cluster Management
Apache Yarn - Hadoop Cluster ManagementDmitry Tolpeko
 
Bi 2.0 hadoop everywhere
Bi 2.0   hadoop everywhereBi 2.0   hadoop everywhere
Bi 2.0 hadoop everywhereDmitry Tolpeko
 
Apache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewApache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewDmitry Tolpeko
 

More from Dmitry Tolpeko (6)

Big Data Analytics for BI, BA and QA
Big Data Analytics for BI, BA and QABig Data Analytics for BI, BA and QA
Big Data Analytics for BI, BA and QA
 
Epam BI - Near Realtime Marketing Support System
Epam BI - Near Realtime Marketing Support SystemEpam BI - Near Realtime Marketing Support System
Epam BI - Near Realtime Marketing Support System
 
Big Data Technology - Solit 2015 Conference
Big Data Technology - Solit 2015 ConferenceBig Data Technology - Solit 2015 Conference
Big Data Technology - Solit 2015 Conference
 
Apache Yarn - Hadoop Cluster Management
Apache Yarn -  Hadoop Cluster ManagementApache Yarn -  Hadoop Cluster Management
Apache Yarn - Hadoop Cluster Management
 
Bi 2.0 hadoop everywhere
Bi 2.0   hadoop everywhereBi 2.0   hadoop everywhere
Bi 2.0 hadoop everywhere
 
Apache Kafka - Messaging System Overview
Apache Kafka - Messaging System OverviewApache Kafka - Messaging System Overview
Apache Kafka - Messaging System Overview
 

Recently uploaded

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 

Recently uploaded (20)

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 

Qubole - Big data in cloud

  • 1. AUG 05, 2016 Qubole - Big Data in Cloud Kiryl Sultanau
  • 7. 7CONFIDENTIAL WHAT IS QUBOLE? 2014 Usage Statistics for Qubole on AWS: • Total QCUH processed in 2014 = 40.6 million • Total nodes managed in 2014 = 2.5 million • Total PB processed in 2014 = 519 Operations Analyst Marketing Ops Analyst Data Architects Business Users Product Support Customer Support Developer Sales Ops Product Managers Developer Tools Service Management Data Workbench Cloud Data Platform BI & DW Systems • SDK • API • Analysis • Security • Job Scheduler • Data Governance • Analytics templates • Monitoring • Support • Collaboration • Workflow & Map/Reduce • Auto Scaling • Cloud Optimization • Data Connectors• YARN • Presto & Hive• Spark & Pig Hadoop Ecosystem (Apache Open Source)
  • 9. 9CONFIDENTIAL CURRENTLY SUPPORTED QDS COMPONENTS QDS Component Currently Supported Versions Cascading Compatible with all versions Hadoop 1 0.20.1 Hadoop 2 2.6.0 HBase 1.0 Hive 0.13.1 and 1.2 MapReduce 0.20.1 and 2.6.0 Pig 0.11 and 0.15 Presto 0.142 Spark 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.6.0, 1.6.1 Sqoop 0.20.1 ??? (No) Tez 0.7 Zeppelin (notebooks) 1.0 ??? (0.6 or 0.5.6)
  • 14. 14CONFIDENTIAL RIGHT TOOL FOR RIGHT WORKLOAD Large scale ETL Interactive Discovery Queries Machine Learning/Real time queries High Performance DW Queries/Reporting backend
  • 15. 15CONFIDENTIAL TIPS & TRICKS S3 is a default storage1 HDFS is temporary storage2 Cluster start time ≈ 90 sec3 Be ready to eventual consistency issue4 Multiple clusters are optimized for different workloads5 Use spot instances6 Use Metastore API7 Cluster restart is useful8 Use Quark if possible9
  • 17. 17CONFIDENTIAL STREAMX: KAFKA CONNECT FOR S3 StreamX is a kafka-connect based connector to copy data from Kafka to Object Stores like Amazon s3, Google Cloud Storage and Azure Blob Store. It focusses on reliable and scalable data copying. One design goal is to write the data in format (like parquet), so that it can readily be used by analytical tools. Features: • Support for writing data in Avro and Parquet formats. • Provides Hive Integration where the connector creates partitioned hive table and periodically adds partitions once it is written in new partition on S3 • Pluggable partitioner : • default partitioner : Each Kafka partition will have its data copied under a partition specific directory • time based partitioner : Ability to write data on hourly basis • field based partitioner : Ability to use a field in the record as custom partitioner • Exactly-once guarantee using WAL • Direct output to S3 (Avoid writing to temporary file and renaming it. • Support for storing Hive tables in Qubole's hive metastore (coming soon)
  • 18. 18CONFIDENTIAL STREAMX: KAFKA CONNECT FOR S3 Configuration (core-site.xml): • fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3a.S3A • fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem • fs.s3a.access.key=xxxxx • fs.s3a.secret.key=xxxxx Sample Run connect-distributed in Kafka : bin/connect-distributed.sh config/connect-distributed.xml {"name":"twitter connector", "config":{ "name":"twitter connector", "connector.class":"com.qubole.streamx.s3.S3SinkConnector", "tasks.max":"1", "flush.size":"100000", "s3.url":"<S3 location>", "wal.class":"com.qubole.streamx.s3.wal.DBWAL", "db.connection.url":"<jdbc:mysql://localhost:3306/kafka>", "db.user":"<username_required>", "db.password":"<password_required>", "hadoop.conf.dir":"<directory where hadoop conf files are stored. Example /usr/lib/hadoop2/etc/hadoop/>", "topics":"twitter1p"}}
  • 19. 19CONFIDENTIAL RUBIX: LIGHT-WEIGHT DATA CACHING FRAMEWORK RubiX is a light-weight data caching framework that can be used by Big-Data engines. RubiX can be extended to support any engine that accesses data in cloud stores using Hadoop FileSystem interface via plugins. Using the same plugins, RubiX can also be extended to be used with any cloud store Supported Engines and Cloud Stores: • Presto: Amazon S3 is supported • Hadoop-1: Amazon S3 is supported How to use it: RubiX has two components: a BookKeeper server and a FileSystem implementation for engine to use. Start the BookKeeper server. It can be started via hadoop jar command, e.g.: hadoop jar rubix-bookkeeper-1.0.jar com.qubole.rubix.bookkeeper.BookKeeperServer Engine side changes: To use RubiX, you need to place the appropriate jars in the classpath and configure Engines to use RubiX filesystem to access the cloud store. Sections below show how to get started on RubiX with supported plugins
  • 20. 20CONFIDENTIAL QUARK: COST-BASED SQL OPTIMIZER Quark optimizes access to data by managing relationships between tables across all databases in an organization. Quark defines materialized views and olap cubes, using them to route queries between tables stored in different databases. Quark is distributed as a JDBC jar and will work with most tools that integrate through JDBC. Create & manage optimized copies of base tables: • Narrow tables with important attributes only. • Sorted tables to speed up filters, joins and aggregation • Denormalized tables wherein tables in a snowflake schema have been joined. OLAP Cubes: Quark supports OLAP cubes on partial data (last 3 months of sales reports for example). It also supports incremental refresh. Bring your own database: Quark enables you to choose the technology stack. For example, optimized copies or OLAP cubes can be stored in Redshift or RDB and base tables can be in S3 or HDFS and accessed through Hive. Rewrite common bad queries: A common example is to miss specifying partition columns which leads to a full table scan. Quark can infer the predicates on partition columns if there are related columns or enforce a policy to limit the data scanned.
  • 21. 21CONFIDENTIAL QUARK: COST-BASED SQL OPTIMIZER Administration: Database administrators are expected to register datasources, define views and cubes. Quark can pull metadata from a multitude of data sources through an extensible Plugin interface. Once the data sources are registered, the tables are referred to as data_source.schema_name.table_name. Quark adds an extra namespace to avoid conflicts. DBAs can define or alter materialized views with DDL statements such as: Internals: Quark’s capabilities are similar to a database optimizer. Internally it uses Apache Calcite which is a cost-based optimizer. It uses the Avatica a sub-project of Apache Calcite to implement the JDBC client and server. Quark parses, optimizes and routes the query to the most optimal dataset. For example, if the last months of data in hive.default.page_views are in a data warehouse, Quark will execute queries in the data warehouse instead of the table in Apache Hive. create view page_views_partition as select * from hive.default.page_views where timestamp between “Mar 1 2016” and “Mar 7 2016” and group in (“en”, “fr”, “de”) stored in data_warehouse.public.pview_partition
  • 22. 22CONFIDENTIAL QDS: CONNECTORS & HANDLERS Hive Storage Handler for Kinesis helps users read from and write to Kinesis Streams using Hive, enabling them to run queries to analyze data that resides in Kinesis. Hive Storage Handler for JDBC is a fork of HiveJdbcStorageHandler, helps users read from and write to JDBC databases using Hive, and also enabling them to run SQL queries to analyze data that resides in JDBC tables. Optimizations such as FilterPushDown have also been added. CREATE TABLE TransactionLogs ( transactionId INT, username STRING, amount INT ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED BY 'com.qubole.hive.kinesis.HiveKinesisStorageHandler' TBLPROPERTIES ('kinesis.stream.name'='TransactionStream'); CREATE EXTERNAL TABLE HiveTable(id INT, names STRING) STORED BY 'org.apache.hadoop.hive.jdbc.storagehandler.JdbcStorageHandler' TBLPROPERTIES ( "mapred.jdbc.driver.class"="com.mysql.jdbc.Driver", "mapred.jdbc.url"="jdbc:mysql://localhost:3306/rstore", "mapred.jdbc.username"="root", "mapred.jdbc.input.table.name"="JDBCTable", "mapred.jdbc.output.table.name"="JDBCTable", "mapred.jdbc.password"="", "mapred.jdbc.hive.lazy.split"= "false");
  • 23. 23CONFIDENTIAL QDS: CONNECTORS & HANDLERS Kinesis Connector for Presto allows the use of Kinesis streams as tables in Presto, such that each data- blob (message) in a kinesis stream is presented as a row in Presto. A flexible table mapping approach lets us treat fields of the messages as columns in the table. { "tableName": "test_table", "schemaName": "otherworld", "streamName": "test_kinesis_stream", "message": { "dataFormat": "json", "fields": [ { "name": "client_id", "type": "BIGINT", "mapping": "body/profile/clientId", "comment": "The client ID field" }, { "name": "routing_time", "mapping": "header/routing_time", "type": "DATE", "dataFormat": "iso8601" } ] } }
  • 24. 24CONFIDENTIAL AIRFLOW AUTHOR, SCHEDULE AND MONITOR WORKFLOWS Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed. Principles: • Dynamic - Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This allows for writing code that instantiates pipelines dynamically. • Extensible - Easily define your own operators, executors and extend the library so that it fits the level of abstraction that suits your environment. • Elegant - Airflow pipelines are lean and explicit. Parameterizing your scripts is built into the core of Airflow using the powerful Jinja templating engine. • Scalable - Airflow has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers. Airflow is ready to scale to infinity.
  • 25. 25CONFIDENTIAL QDS APIs & SDKs Qubole REST API Qubole Python SDK Qubole Java SDK
  • 26. 26CONFIDENTIAL QDS REST APIs The Qubole Data Service (QDS) is accessible via REST APIs. Access URL: https://api.qubole.com/api/${V}/ where ${V} version of the API (v1.2 … latest). Authentication: Qubole API Token set API token to the AUTH_TOKEN environment variable. API Types: •Account API •Cluster API •Command Template API •DbTap API •Groups API •Hive Metadata API •Reports API •Roles API •Scheduler API •Users API •Command API • Hive commands • Hadoop jobs • Pig commands • Presto commands • Spark commands • DbImport commands • DbExport commands • Shell commands
  • 27. 27CONFIDENTIAL QDS HIVE METADATA API QDS Hive Metadata API Types: •Schema or Database •Get Table Definition •Store Table Properties •Get Table Properties •Delete Table Properties Get Table Properties: Store Table Properties: curl -i -X GET -H "Accept: application/json" -H "Content-type: application/json" -H "X-AUTH-TOKEN: $AUTH_TOKEN" "https://api.qubole.com/api/v1.2/hive/default/daily_tick/table_properties" cat pl.json { "interval": "1", "interval_unit": "days", "columns": { "stock_exchange": "", "stock_symbol": "", "year": "%Y", "date": "%Y-%m-%d" } } curl -i -X POST -H "Accept:application/json" -H "Content-type:application/json“ -H "X-AUTH-TOKEN: $AUTH_TOKEN" --data @pl.json https://api.qubole.com/api/v1.2/hive/default/daily_tick/properties { "location": "s3n://paid-qubole/data/stock_tk", "owner": "ec2-user", "create-time": 1362388416, "table-type": "EXTERNAL_TABLE", "field.delim": ",", "serialization.format": "," }
  • 28. 28CONFIDENTIAL QDS HIVE COMMAND API This API is used to submit a Hive query. Parameter Description query Specify Hive query to run. Either query or script_location is required script_location Specify a S3 path where the hive query to run is stored. Either query or script_location is required. AWS storage credentials stored in the account are used to retrieve the script file. command_type Hive command label Cluster label to specify the cluster to run this command retry Denotes the number of retries for a job. Valid values of retry are 1, 2, and 3. macros Expressions to evaluate macros used in the hive command. Refer to Macros in Scheduler for more details. sample_size Size of sample in bytes on which to run the query for test mode. approx_mode_progress Value of progress for constrained run. Valid value of float between 0 and 1 approx_mode_max_rt Constrained run max runtime in seconds approx_mode_min_rt Constrained run min runtime in seconds approx_aggregations Convert count distinct to count approx. Valid values are bool or NULL name Add a name to the command that is useful while filtering commands from the command history. It does not accept & (ampersand), < (lesser than), > (greater than), ” (double quotes), and ‘ (single quote) special characters, and HTML tags as well. It can contain a maximum of 255 characters. tag Add a tag to a command so that it is easily identifiable and searchable from Commands History. Add a tag as a filter value while searching commands. Max 255 characters. Comma is separator for several tags.
  • 29. 29CONFIDENTIAL QDS HIVE COMMAND API Count the number of rows in the table: Response: export QUERY="select count(*) as num_rows from miniwikistats;" curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" -d '{ "query":"$QUERY", "command_type": "HiveCommand" }' "https://api.qubole.com/api/${V}/commands/" HTTP/1.1 200 OK { "command": { "approx_mode": false, "approx_aggregations": false, "query": "select count(*) as num_rows from miniwikistats;", "sample": false }, "qbol_session_id": 0000, … "progress": 0, "meta_data": { "results_resource": "commands/3852/results", "logs_resource": "commands/3852/logs" } }
  • 30. 30CONFIDENTIAL QDS HIVE COMMAND API Run a query stored in a S3 file location: Run a parameterized query stored in a S3 file location: Submitting a Hive Query to a Specific Cluster: cat payload.json { "script_location":"<S3 Path>", "command_type": "HiveCommand" } curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" -d @payload "https://api.qubole.com/api/${V}/commands/" cat payload.json { "script_location":"<S3 Path>", "macros":[ {"date":"moment('2011-01-11T00:00:00+00:00')"}, {"formatted_date":"date.clone().format('YYYY-MM-DD')"}], "command_type": "HiveCommand" } curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" -d @payload "https://api.qubole.com/api/${V}/commands/" curl -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" -d '{"query":"show tables;", "label":"HadoopCluster", "command_type": "HiveCommand"}' "https://api.qubole.com/api/${V}/commands"
  • 31. 31CONFIDENTIAL QDS SPARK COMMAND API This API is used to submit a Spark command. Parameter Description program Provide the complete Spark Program in Scala, SQL, Command, R, or Python. language Specify the language of the program, Scala, SQL, Command or Python. Required only when a program is used. arguments Specify the spark-submit command line arguments here. user_program_ arguments Specify the arguments that the user program takes in. cmdline Alternatively, you can provide the spark-submit command line itself. If you use this option, you cannot use any other parameters mentioned here. All required information is captured in command line itself. command_type Spark command label Cluster label to specify the cluster to run this command app_id ID of an app, which is a main abstraction of the Spark Job Server API. An app is used to store the configuraton for a Spark application. See Understanding the Spark Job Server for more information. name Add a name to the command that is useful while filtering commands from the command history. It does not accept & (ampersand), < (lesser than), > (greater than), ” (double quotes), and ‘ (single quote). Max 255 characters. tag Add a tag to a command so that it is easily identifiable and searchable from Commands History. Add a tag as a filter value while searching commands. Max 255 characters. Comma is separator for several tags.
  • 32. 32CONFIDENTIAL QDS SPARK COMMAND API Example Python API Framework: Example to Submit Spark Scala Program: import sys, pycurl, json c= pycurl.Curl() url="https://api.qubole.com/api/v1.2/commands" auth_token = <provide auth token here> c.setopt(pycurl.URL, url) c.setopt(pycurl.HTTPHEADER, ["X-AUTH-TOKEN: "+ auth_token, "Content-Type:application/json", "Accept: application/json"]) c.setopt(pycurl.POST,1) prog = ''' import scala.math.random import org.apache.spark._ object SparkPi { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Spark Pi") val spark = new SparkContext(conf) val slices = if (args.length > 0) args(0).toInt else 2 val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow val count = spark.parallelize(1 until n, slices).map { i => val x = random * 2 - 1 val y = random * 2 - 1 if (x*x + y*y < 1) 1 else 0 }.reduce(_ + _) println("Pi is roughly " + 4.0 * count / n) spark.stop() } } ''' data=json.dumps({"program":prog,"language":"scala","arguments":"--class SparkPi", command_type":"SparkCommand"}) c.setopt(pycurl.POSTFIELDS, data) c.perform()
  • 33. 33CONFIDENTIAL QDS SPARK COMMAND API Example to Submit Spark Command in SQL: Example to Submit a Spark Command in SQL to a Spark Job Server App: Where app_id = Spark Job Server app ID. See Understanding the Spark Job Server for more information. curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" - H "Accept: application/json" -d ‘{ "sql":"select * from default_qubole_memetracker limit 10;", "language":"sql","command_type":"SparkCommand", "label":"spark" }' "https://api.qubole.com/api/${V}/commands" curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" - H "Accept: application/json" -d '{ "sql":"select * from default_qubole_memetracker limit 10;", "language":"sql","command_type":"SparkCommand", "label":"spark","app_id":"3" }' "https://api.qubole.com/api/${V}/commands"
  • 34. 34CONFIDENTIAL QDS SCHEDULE API This API creates a new schedule to run commands automatically at certain frequency. Parameter Description command_type A valid command type supported by Qubole. For example, HiveCommand, HadoopCommand, PigCommand. command JSON object describing the command. Refer to the Command API for more details. Sub fields can use macros. Refer to the Qubole Scheduler for more details. start_time Start datetime for the schedule end_time End datetime for the schedule frequency Specify how often the job should run. Input is an integer. For example, frequency of one hour/day/month is represented as {"frequency":"1"} time_unit Denotes the time unit for the frequency. Its default value is days. Accepted value is minutes, hours, days, weeks, or months. … … concurrency Specify how many job actions can run at a time. Default value is 1. dependency_info Describe dependencies for this job. Check the Hive Datasets as Schedule Dependency for more information. Notification Parameters It is an optional parameter that is set to false by default. You can set it to true if you want to be notified through email about instance failure. Notification Parameters provides more information.
  • 35. 35CONFIDENTIAL QDS SCHEDULE API The query shown below aggregates the data for every stock symbol, every day: curl -i -X POST -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Accept: application/json" -H "Content-type: application/json" -d '{ "command_type":"HiveCommand", "command": { "query": "select stock_symbol, max(high), min(low), sum(volume) from daily_tick_data where date1='$formatted_date$' group by stock_symbol" }, "macros": [ { "formatted_date": "Qubole_nominal_time.format('YYYY-MM-DD')" } ], notification:{"is_digest": false, "notification_email_list":["user@qubole.com"], "notify_failure": true, "notify_success": false}` "start_time": "2012-07-01T02:00Z", "end_time": "2022-07-01T02:00Z", "frequency": "1", "time_unit": "days", "time_out":"10", "dependency_info": {} }' "https://api.qubole.com/api/v1.2/scheduler"
  • 36. 36CONFIDENTIAL QUBOLE DATA SERVICE JAVA SDK A Java library that provides the tools you need to authenticate with and use the Qubole API. Installation Usage Allocate a QdsClient object: Then, make api calls as needed… <dependency> <groupId>com.qubole.qds-sdk-java</groupId> <artifactId>qds-sdk-java</artifactId> <version>0.7.0</version> </dependency> QdsConfiguration configuration = new DefaultQdsConfiguration(YOUR_API_KEY); QdsClient client = QdsClientFactory.newClient(configuration);
  • 37. 37CONFIDENTIAL QUBOLE DATA SERVICE JAVA SDK API call: API call (with Jersey's callback mechanism): Future<CommandResponse> hiveCommandResponseFuture = client .command().hive().query("show tables;").invoke(); CommandResponse hiveCommandResponse = hiveCommandResponseFuture.get(); InvocationCallback<CommandResponse> callback = new InvocationCallback<CommandResponse>() { @Override public void completed(CommandResponse clusterItems) { // ... } @Override public void failed(Throwable throwable) { // ... } }; client.command().hive().query("show tables;").withCallback(callback).invoke(); ...
  • 38. 38CONFIDENTIAL QUBOLE DATA SERVICE JAVA SDK Waiting for Results (Blocking): Waiting for Results (with callback): Paging: ResultLatch latch = new ResultLatch(client, queryId); ResultValue = latch.awaitResult(); ResultLatch.Callback callback = new ResultLatch.Callback() { @Override public void result(String queryId, ResultValue resultValue) {// use results } @Override public void error(String queryId, Exception e) {// handle error } }; ResultLatch latch = new ResultLatch(client, queryId); latch.callback(callback); // return page 2 using 3 per page client.command().history().forPage(2, 3).invoke();
  • 39. 39CONFIDENTIAL QUBOLE DATA SERVICE PYTHON SDK A Python module that provides the tools you need to authenticate with and use the Qubole API. Installation or CLI Usage (qds.py allows running Hive, Hadoop, Pig, Presto and Shell commands against QDS. Users can run commands synchronously - or submit a command and check its status): Pass in api token from bash environment variable $ pip install qds-sdk $ python setup.py install $ qds.py --token 'xxyyzz' hivecmd run --query "show tables" $ qds.py --token 'xxyyzz' hivecmd run --script_location /tmp/myquery $ qds.py --token 'xxyyzz' hivecmd run --script_location s3://my-qubole-location/myquery $ export QDS_API_TOKEN=xxyyzz $ qds.py hadoopcmd run streaming -files 's3n://paid- qubole/HadoopAPIExamples/WordCountPython/mapper.py,s3n://paid- qubole/HadoopAPIExamples/WordCountPython/reducer.py' -mapper mapper.py -reducer reducer.py - numReduceTasks 1 -input 's3n://paid-qubole/default-datasets/gutenberg' -output 's3n://example.bucket.com/wcout' $ qds.py hivecmd check 12345678 {"status": "done", ... }
  • 40. 40CONFIDENTIAL QUBOLE DATA SERVICE JAVA SDK Programmatic Usage (Python application needs to do the following): 1) Set the api_token: 2) Use the Command classes defined in commands.py to execute commands. To run Hive cmd: from qds_sdk.qubole import Qubole Qubole.configure(api_token='ksbdvcwdkjn123423') from qds_sdk.commands import * hc=HiveCommand.create(query='show tables') print "Id: %s, Status: %s" % (str(hc.id), hc.status)