This document provides instructions on how to install and configure Apache Drill to connect to various data sources like Oracle, Hive, and HBase. It describes how to use Drill's storage plugins to query data from these sources and also combine data from multiple sources using Drill queries. Examples of queries on each data source and combining data sources are also provided.
Working with Delimited Data in Apache Drill 1.6.0Vince Gonzalez
This presentation is a tutorial on using Apache Drill 1.6.0 to query delimited data, such as in the CSV or TSV formats. This was presented in a workshop format, and I'm available to present this to your team as well.
The tutorial covers typical steps taken on the way to using Drill to make delimited data visible to BI tools, such as Qlik Sense, which I use for the visualizations in the slides.
MapR provides professional support for Apache Drill, please contact me if you're interested in learning more!
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...The Hive
SQL is one of the most widely used languages to access, analyze, and manipulate structured data. As Hadoop gains traction within enterprise data architectures across industries, the need for SQL for both structured and loosely-structured data on Hadoop is growing rapidly Apache Drill started off with the audacious goal of delivering consistent, millisecond ANSI SQL query capability across wide range of data formats. At a high level, this translates to two key requirements – Schema Flexibility and Performance. This session will delve into the architectural details in delivering these two requirements and will share with the audience the nuances and pitfalls we ran into while developing Apache Drill.
Apache Drill is the next generation of SQL query engines. It builds on ANSI SQL 2003, and extends it to handle new formats like JSON, Parquet, ORC, and the usual CSV, TSV, XML and other Hadoop formats. Most importantly, it melts away the barriers that have caused databases to become silos of data. It does so by able to handle schema-changes on the fly, enabling a whole new world of self-service and data agility never seen before.
Jim Scott, CHUG co-founder and Director, Enterprise Strategy and Architecture for MapR presents "Using Apache Drill". This presentation was given on August 13th, 2014 at the Nokia office in Chicago, IL.
Jim has held positions running Operations, Engineering, Architecture and QA teams. He has worked in the Consumer Packaged Goods, Digital Advertising, Digital Mapping, Chemical and Pharmaceutical industries. His work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop.
Apache Drill brings the power of standard ANSI:SQL 2003 to your desktop and your clusters. It is like AWK for Hadoop. Drill supports querying schemaless systems like HBase, Cassandra and MongoDB. Use standard JDBC and ODBC APIs to use Drill from your custom applications. Leveraging an efficient columnar storage format, an optimistic execution engine and a cache-conscious memory layout, Apache Drill is blazing fast. Coordination, query planning, optimization, scheduling, and execution are all distributed throughout nodes in a system to maximize parallelization. This presentation contains live demonstrations.
The video can be found here: http://vimeo.com/chug/using-apache-drill
Join our experts Neeraja Rentachintala, Sr. Director of Product Management and Aman Sinha, Lead Software Engineer and host Sameer Nori in a discussion about putting Apache Drill into production.
The open source project Apache Drill gives you SQL-on-Hadoop, but with some big differences. The biggest difference is that Drill extends ANSI SQL from a strongly typed language to also a late binding language without losing performance. This allows Drill to process complex structured data like JSON in addition to relational data. By dynamically generating a schema at read time that matches the data types and structures observed in the data, Drill gives you both self-service agility and speed.
Drill also introduces a view-based security model that uses file system permissions to control access to data at an extremely fine-grained level that makes secure access easy to control. These extensions have huge practical impact when it comes to writing real applications.
In these slides, Tugdual Grall, Technical Evangelist at MapR, gives several practical examples of how Drill makes it easy to analyze data, using SQL in your Java application with a simple JDBC driver.
The Extract-Transform-Load (ETL) process is one of the most time consuming processes facing anyone who wishes to analyze data. Imagine if you could quickly, easily and scaleably merge and query data without having to spend hours in data prep. Well.. you don’t have to imagine it. You can with Apache Drill. In this hands-on, interactive presentation Mr. Givre will show you how to unleash the power of Apache Drill and explore your data without any kind of ETL process.
Working with Delimited Data in Apache Drill 1.6.0Vince Gonzalez
This presentation is a tutorial on using Apache Drill 1.6.0 to query delimited data, such as in the CSV or TSV formats. This was presented in a workshop format, and I'm available to present this to your team as well.
The tutorial covers typical steps taken on the way to using Drill to make delimited data visible to BI tools, such as Qlik Sense, which I use for the visualizations in the slides.
MapR provides professional support for Apache Drill, please contact me if you're interested in learning more!
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...The Hive
SQL is one of the most widely used languages to access, analyze, and manipulate structured data. As Hadoop gains traction within enterprise data architectures across industries, the need for SQL for both structured and loosely-structured data on Hadoop is growing rapidly Apache Drill started off with the audacious goal of delivering consistent, millisecond ANSI SQL query capability across wide range of data formats. At a high level, this translates to two key requirements – Schema Flexibility and Performance. This session will delve into the architectural details in delivering these two requirements and will share with the audience the nuances and pitfalls we ran into while developing Apache Drill.
Apache Drill is the next generation of SQL query engines. It builds on ANSI SQL 2003, and extends it to handle new formats like JSON, Parquet, ORC, and the usual CSV, TSV, XML and other Hadoop formats. Most importantly, it melts away the barriers that have caused databases to become silos of data. It does so by able to handle schema-changes on the fly, enabling a whole new world of self-service and data agility never seen before.
Jim Scott, CHUG co-founder and Director, Enterprise Strategy and Architecture for MapR presents "Using Apache Drill". This presentation was given on August 13th, 2014 at the Nokia office in Chicago, IL.
Jim has held positions running Operations, Engineering, Architecture and QA teams. He has worked in the Consumer Packaged Goods, Digital Advertising, Digital Mapping, Chemical and Pharmaceutical industries. His work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop.
Apache Drill brings the power of standard ANSI:SQL 2003 to your desktop and your clusters. It is like AWK for Hadoop. Drill supports querying schemaless systems like HBase, Cassandra and MongoDB. Use standard JDBC and ODBC APIs to use Drill from your custom applications. Leveraging an efficient columnar storage format, an optimistic execution engine and a cache-conscious memory layout, Apache Drill is blazing fast. Coordination, query planning, optimization, scheduling, and execution are all distributed throughout nodes in a system to maximize parallelization. This presentation contains live demonstrations.
The video can be found here: http://vimeo.com/chug/using-apache-drill
Join our experts Neeraja Rentachintala, Sr. Director of Product Management and Aman Sinha, Lead Software Engineer and host Sameer Nori in a discussion about putting Apache Drill into production.
The open source project Apache Drill gives you SQL-on-Hadoop, but with some big differences. The biggest difference is that Drill extends ANSI SQL from a strongly typed language to also a late binding language without losing performance. This allows Drill to process complex structured data like JSON in addition to relational data. By dynamically generating a schema at read time that matches the data types and structures observed in the data, Drill gives you both self-service agility and speed.
Drill also introduces a view-based security model that uses file system permissions to control access to data at an extremely fine-grained level that makes secure access easy to control. These extensions have huge practical impact when it comes to writing real applications.
In these slides, Tugdual Grall, Technical Evangelist at MapR, gives several practical examples of how Drill makes it easy to analyze data, using SQL in your Java application with a simple JDBC driver.
The Extract-Transform-Load (ETL) process is one of the most time consuming processes facing anyone who wishes to analyze data. Imagine if you could quickly, easily and scaleably merge and query data without having to spend hours in data prep. Well.. you don’t have to imagine it. You can with Apache Drill. In this hands-on, interactive presentation Mr. Givre will show you how to unleash the power of Apache Drill and explore your data without any kind of ETL process.
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
Learn how Drill achieves high performance with flexibility and ease of use. Includes: First read planning and statistics. Flexible code generation depending on workload. Code optimization and planning techniques. Dynamic schema subsets. Advanced memory use and moving between Java and C. Making a static typing appear dynamic through any-time and multi-phase planning.
Apache Drill [1] is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel technology. It is a design goal to scale to 10,000 servers or more and to be able to process Petabytes of data and trillions of records in seconds. Since its inception in mid 2012, Apache Drill has gained widespread interest in the community. In this talk we focus on how Apache Drill enables interactive analysis and query at scale. First we walk through typical use cases and then delve into Drill's architecture, the data flow and query languages as well as data sources supported.
[1] http://incubator.apache.org/drill/
Slides for presentation on Cloudera Impala I gave at the DC/NOVA Java Users Group on 7/9/2013. It is a slightly updated set of slides from the ones I uploaded a few months ago on 4/19/2013. It covers version 1.0.1 and also includes some new slides on HortonWorks' Stinger Initiative.
Apache Drill (http://incubator.apache.org/drill/) is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel technology. It is designed to scale to thousands of servers and able to process Petabytes of data in seconds. Since its inception in mid 2012, Apache Drill has gained widespread interest in the community, attracting hundreds of interested individuals and companies. In the talk we discuss how Apache Drill enables ad-hoc interactive query at scale, walking through typical use cases and delve into Drill's architecture, the data flow and query languages as well as data sources supported.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
Learn who is best suited to attend the full training, what prior knowledge you should have, and what topics the course covers. Cloudera Curriculum Developer, Jesse Anderson, will discuss the skills you will attain during the course and how they will help you move make the most of your HBase deployment in development or production and prepare for the Cloudera Certified Specialist in Apache HBase (CCSHB) exam.
Introduction to Apache HBase, MapR Tables and SecurityMapR Technologies
This talk with focus on two key aspects of applications that are using the HBase APIs. The first part will provide a basic overview of how HBase works followed by an introduction to the HBase APIs with a simple example. The second part will extend what we've learned to secure the HBase application running on MapR's industry leading Hadoop.
Keys Botzum is a Senior Principal Technologist with MapR Technologies. He has over 15 years of experience in large scale distributed system design. At MapR his primary responsibility is working with customers as a consultant, but he also teaches classes, contributes to documentation, and works with MapR engineering. Previously he was a Senior Technical Staff Member with IBM and a respected author of many articles on WebSphere Application Server as well as a book. He holds a Masters degree in Computer Science from Stanford University and a B.S. in Applied Mathematics/Computer Science from Carnegie Mellon University.
From docker to kubernetes: running Apache Hadoop in a cloud native wayDataWorks Summit
Creating containers for an application is easy (even if it’s a goold old distributed application like Apache Hadoop), just a few steps of packaging.
The hard part isn't packaging: it's deploying
How can we run the containers together? How to configure them? How do the services in the containers find and talk to each other? How do you deploy and manage clusters with hundred of nodes?
Modern cloud native tools like Kubernetes or Consul/Nomad could help a lot but they could be used in different way.
It this presentation I will demonstrate multiple solutions to manage containerized clusters with different cloud-native tools including kubernetes, and docker-swarm/compose.
No matter which tools you use, the same questions of service discovery and configuration management arise. This talk will show the key elements needed to make that containerized cluster work.
Tools:
kubernetes, docker-swam, docker-compose, consul, consul-template, nomad
together with: Hadoop, Yarn, Spark, Kafka, Zookeeper, Storm….
References:
https://github.com/flokkr
Speaker
Marton Elek, Lead Software Engineer, Hortonworks
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
Learn how Drill achieves high performance with flexibility and ease of use. Includes: First read planning and statistics. Flexible code generation depending on workload. Code optimization and planning techniques. Dynamic schema subsets. Advanced memory use and moving between Java and C. Making a static typing appear dynamic through any-time and multi-phase planning.
Apache Drill [1] is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel technology. It is a design goal to scale to 10,000 servers or more and to be able to process Petabytes of data and trillions of records in seconds. Since its inception in mid 2012, Apache Drill has gained widespread interest in the community. In this talk we focus on how Apache Drill enables interactive analysis and query at scale. First we walk through typical use cases and then delve into Drill's architecture, the data flow and query languages as well as data sources supported.
[1] http://incubator.apache.org/drill/
Slides for presentation on Cloudera Impala I gave at the DC/NOVA Java Users Group on 7/9/2013. It is a slightly updated set of slides from the ones I uploaded a few months ago on 4/19/2013. It covers version 1.0.1 and also includes some new slides on HortonWorks' Stinger Initiative.
Apache Drill (http://incubator.apache.org/drill/) is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel technology. It is designed to scale to thousands of servers and able to process Petabytes of data in seconds. Since its inception in mid 2012, Apache Drill has gained widespread interest in the community, attracting hundreds of interested individuals and companies. In the talk we discuss how Apache Drill enables ad-hoc interactive query at scale, walking through typical use cases and delve into Drill's architecture, the data flow and query languages as well as data sources supported.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
Learn who is best suited to attend the full training, what prior knowledge you should have, and what topics the course covers. Cloudera Curriculum Developer, Jesse Anderson, will discuss the skills you will attain during the course and how they will help you move make the most of your HBase deployment in development or production and prepare for the Cloudera Certified Specialist in Apache HBase (CCSHB) exam.
Introduction to Apache HBase, MapR Tables and SecurityMapR Technologies
This talk with focus on two key aspects of applications that are using the HBase APIs. The first part will provide a basic overview of how HBase works followed by an introduction to the HBase APIs with a simple example. The second part will extend what we've learned to secure the HBase application running on MapR's industry leading Hadoop.
Keys Botzum is a Senior Principal Technologist with MapR Technologies. He has over 15 years of experience in large scale distributed system design. At MapR his primary responsibility is working with customers as a consultant, but he also teaches classes, contributes to documentation, and works with MapR engineering. Previously he was a Senior Technical Staff Member with IBM and a respected author of many articles on WebSphere Application Server as well as a book. He holds a Masters degree in Computer Science from Stanford University and a B.S. in Applied Mathematics/Computer Science from Carnegie Mellon University.
From docker to kubernetes: running Apache Hadoop in a cloud native wayDataWorks Summit
Creating containers for an application is easy (even if it’s a goold old distributed application like Apache Hadoop), just a few steps of packaging.
The hard part isn't packaging: it's deploying
How can we run the containers together? How to configure them? How do the services in the containers find and talk to each other? How do you deploy and manage clusters with hundred of nodes?
Modern cloud native tools like Kubernetes or Consul/Nomad could help a lot but they could be used in different way.
It this presentation I will demonstrate multiple solutions to manage containerized clusters with different cloud-native tools including kubernetes, and docker-swarm/compose.
No matter which tools you use, the same questions of service discovery and configuration management arise. This talk will show the key elements needed to make that containerized cluster work.
Tools:
kubernetes, docker-swam, docker-compose, consul, consul-template, nomad
together with: Hadoop, Yarn, Spark, Kafka, Zookeeper, Storm….
References:
https://github.com/flokkr
Speaker
Marton Elek, Lead Software Engineer, Hortonworks
Apache Hive provides SQL-like access to your stored data in Apache Hadoop. Apache HBase stores tabular data in Hadoop and supports update operations. The combination of these two capabilities is often desired, however, the current integration show limitations such as performance issues. In this talk, Enis Soztutar will present an overview of Hive and HBase and discuss new updates/improvements from the community on the integration of these two projects. Various techniques used to reduce data exchange and improve efficiency will also be provided.
Conference:
2012 The IEEE International Workshop on Intelligent Energy Systems (IWIES)
Title of the paper:
Decision Support Tool for Retrofitting a District Towards District as a Service
Authors:
Anna Florea, Corina Postelnicu, Jose Luis Martinez Lastra, Mirko Presser, Trine Plambech, Mikel Larrañaga, Antonio Colino, José Antonio Márquez Contreras, Víctor Manuel Bayona Pons
Assessment of IEC-61499 and CDL for Function Block composition in factory-wide system integration
•Date: July, 2013
•Linked to: PLANTCockpit
Contact information
Tampere University of Technology, FAST Laboratory, P.O. Box 600, FIN-33101 Tampere, Finland
Email: fast@tut.fi www.tut.fi/fast
Conference: 11th IEEE International Conference on Industrial Informatics, INDIN 2013. Bochum, Germany – July 29-31 2013
Title of the paper: Assessment of IEC-61499 and CDL for Function Block composition in factory-wide system integration
Authors: Borja Ramis, Jorge Garcia, Jose L. Martinez Lastra
If you would like to receive a reprint of the original paper, please contact us
Smartphones zijn niet meer weg te denken uit het dagelijks leven. Bedrijven ontwikkelen steeds meer en meer toepassingen om hun producten of diensten via deze smartphones aan te bieden.
Sencha Touch stelt ontwikkelaars in staat om met behulp van HTML5, CSS en JavaScript applicaties te bouwen die zowel op Android als op iOS draaien.
Deze technische sessie wordt gegeven door Tom Druyts.
How to Create Login and Registration API in PHP.pdfAppweb Coders
In today’s article, we will explore the concept of REST API and delve into creating a login and registration system using these APIs. In the contemporary landscape of web development, establishing strong and secure authentication systems is of utmost significance. A highly effective approach is to construct a Login and Registration system through the utilization of REST APIs. This article aims to provide you with a comprehensive walkthrough, enabling you to construct a robust and efficient user authentication system from the ground up, harnessing the capabilities of REST architecture.
REST (Representational State Transfer) APIs act as a bridge between the client and the server, facilitating effective communication between them. They utilize HTTP requests to transfer data and are an optimal choice for constructing systems due to their stateless nature. REST APIs provide a seamless integration experience across a variety of platforms and devices.
Before we start coding, ensure you have a development environment set up. Install a web server (e.g., Apache), PHP, and a database (such as MySQL). Organize your project directory and create separate folders for PHP files, configurations, and assets.
Note: In this tutorial, we are utilizing PDO for all database operations. If you are interested in learning about using MySQL or MySQLi, please leave a comment indicating your preference. I will either update this tutorial or create a new article on that topic as well.
A quick introduction to node.js in order to have good basics to build a simple website.
This slide covers:
- node.js (you don't say?)
- express
- jade
- mongoDB
- mongoose
iPhone applications can often benefit by talking to a web service to synchronize data or share information with a community. Ruby on Rails, with its RESTful conventions, is an ideal backend for iPhone applications. In this session you'll learn how to use ObjectiveResource in an iPhone application to interact with a RESTful web service implemented in Rails. This session isn't about how to build web applications that are served up on the iPhone. It's about how to build iPhone applications with a native look and feel that happen to talk to Rails applications under the hood. The upshot is a user experience that transcends the device.
Enable Database Service over HTTP or IBM WebSphere MQ in 15_minutes with IASInvenire Aude
How to access the database and implement a service over HTTP/REST or IBM WebSphere MQ (XML or JSON).
Quick introduction to the SQL extension for IAS Data Processors.
http://www.invenireaude.com/content/blogs/index.html
This document helps to understand the basics of expressjs and codes related nodejs. The document covers the middleware concepts, routing in nodejs and session management in nodejs.
09 - express nodes on the right angle - vitaliy basyuk - it event 2013 (5)Igor Bronovskyy
09 - Express Nodes on the right Angle - Vitaliy Basyuk - IT Event 2013 (5)
60 вузлів під правильним кутом - миттєва розробка програмних додатків використовуючи Node.js + Express + MongoDB + AngularJS.
Коли ми беремось за новий продукт, передусім ми думаємо про пристрасть, яка необхідна йому, щоб зробити користувача задоволеним і відданим нашому баченню. А що допомагає нам здобути прихильність користувачів? Очевидно, що окрім самої ідеї, також важлими будуть: зручний користувацький інтерфейс, взаємодія в реальному часі та прозора робота з даними. Ці три властивості ми можемо здобути використовучи ті чи інші засоби, проте, коли все лиш починається, набагато зручніше, якщо інструменти допомагають втілити бажане, а не відволікають від головної мети.
Ми розглянемо процес розробки, використовуючи Node.js, Express, MongoDB та AngularJS як найбільш корисного поєднання для отримання вагомої переваги вже на старті вашого продукту.
Віталій Басюк
http://itevent.if.ua/lecture/express-nodes-right-angle-rapid-application-development-using-nodejs-express-mongodb-angular
Workshop: EmberJS - In Depth
- Ember Data - Adapters & Serializers
- Routing and Navigation
- Templates
- Services
- Components
- Integration with 3rd party libraries
Presentado por ingenieros: Mario García y Marc Torrent
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Adjusting primitives for graph : SHORT REPORT / NOTES
Apache Drill with Oracle, Hive and HBase
1. Page1
APACHE DRILL WITH ORACLE, HIVE AND HBASE
Prepared By: Nag Arvind Gudiseva
PROBLEM STATEMENT
Create a data pipelineby analysing data frommultipledata sources and persista JSON document.
ARCHITECTURAL SOLUTION
Use Apache Drill StoragePlugins to connect to RDBMS (MySQL, Oracle,etc.), NoSQL databases (MongoDB, Hive, HBase, etc.) and
text documents (JSON, CSV, etc.). Analysethe data in the tables (or dynamic schema on the fly for text documents) and leverage
Apache Drill API to combine data from different tables (or text documents) from different data sources on the fly. Apache Drill
exposes a REST WebService, which can be consumed usingJava Jersey REST Clientprogram. One can call thePOST method and
submitDrill Queries as a requestobject and receive the response in a JSON format, which can then be persisted on the Local File
System.
PICTORIAL ILLUSTRATION
2. Page2
INSTALLATION STEPS ON UBUNTU 14.04 VM
1. Download Apache Drill usingwget command
wget http://mirror.symnds.com/software/Apache/drill/drill-1.4.0/apache-drill-1.4.0.tar.gz
2. Untar and extract
tar -xvzf apache-drill-1.4.0.tar.gz
3. Move the folder to a preferred location
sudo mv apache-drill-1.4.0 /usr/local/apache-drill
4. Install Zookeeper:
a. Download the stableversion (zookeeper-3.4.6.tar.gz) from http://hadoop.apache.org/zookeeper/releases.html
b. Untar and move the folder to a preferred location
c. Rename zoo_sample.cfg as zoo.cfg
STARTING DRILL
a. EMBEDDED MODE (with SqlLine)
<DRILL_HOME>/bin/sqlline -u jdbc:drill:zk=local
(OR)
./bin/drill-embedded
b. DISTRIBUTED MODE (Start ZooKeeper and Drill Bit)
<ZOOKEEPER_HOME>/bin/zkServer.sh start
< ZOOKEEPER _HOME>/bin/zkServer.sh status
(AND)
<DRILL_HOME>/bin/drillbit.sh start
<DRILL_HOME>/bin/drillbit.sh status
STOPPING DRILL (AND ZOOKEEPER)
a. EMBEDDED MODE (with SqlLine)
0: jdbc:drill:zk=local>!quit
3. Page3
b. DISTRIBUTED MODE (Stop Drill Bitand ZooKeeper)
<DRILL_HOME>/bin/drillbit.sh stop
<DRILL_HOME>/bin/drillbit.sh status
(AND)
< ZOOKEEPER _HOME>/bin/zkServer.sh stop
< ZOOKEEPER _HOME>/bin/zkServer.sh status
JAR DEFAULT QUERIES
REFERENCE: <DRILL_HOME>/jars/3rdparty/foodmart-data-json-0.4.jar
0: jdbc:drill:zk=local>showdatabases;
0: jdbc:drill:zk=local>selectemployee_id, first_name,last_name,position_id,salary FROMcp.`employee.json` where salary >
30000;
0: jdbc:drill:zk=local>selectemployee_id, first_name,last_name,position_id,salary FROMcp.`employee.json` where salary >
30000 and position_id=2;
0: jdbc:drill:zk=local>selectemp.employee_id, emp.first_name,emp.salary,emp.department_id FROM cp.`employee.json` emp
where emp.salary <40000 and emp.salary>21000;
0: jdbc:drill:zk=local>selectemp.employee_id, emp.first_name,emp.salary,emp.department_id,dept.department_description
FROM cp.`employee.json` emp , cp.`department.json` dept where emp.salary <40000 and emp.salary>21000 and
emp.department_id = dept.department_id;
JSON SAMPLE QUERIES
SELECT * from dfs.`/home/gudiseva/arvind/zips.json` LIMIT10;
CSV SAMPLE QUERIES
select * FROM dfs.`/home/gudiseva/arvind/sample.csv`;
select columns[0] as id,columns[1] name, columns[2] as weight, columns[3] as height FROM
dfs.`/home/gudiseva/arvind/sample.csv`;
CREATING VIEW BY QUERYING MULTIPLE DATA SOURCES
CREATE or REPLACE view dfs.tmp.MULTI_VIEW as selectemp.employee_id, phy.columns[1] as Name
,dept.department_description,phy.columns[2] as Weight, phy.columns[3] as Height FROM cp.`employee.json` emp ,
cp.`department.json` dept, dfs.`/home/gudiseva/arvind/sample.csv` phy where CAST(emp.employee_id AS INT) =
CAST(phy.columns[0] AS INT) and emp.department_id = dept.department_id;
SELECT * FROM dfs.tmp.MULTI_VIEW;
5. Page5
"fs.default.name": "file:///",
"hive.metastore.sasl.enabled": "false"
}
}
select * from hive.arvind.`employee`;
NOTE:
HIVE SERVER should be started
$ hive --servicehiveserver --verbose
[hive shell will not work when Hive Server is started]
MONGODB
{
"type": "mongo",
"connection": "mongodb://first_name:last_name@ds048537.mongolab.com:48537/m101",
"enabled": true
}
select `_id`, `value` from mongo.m101.`storm`;
HBASE
{
"type": "hbase",
"config": {
"hbase.zookeeper.quorum": "localhost",
"hbase.zookeeper.property.clientPort": "2181"
},
"size.calculator.enabled":false,
"enabled": true
}
SELECT CONVERT_FROM(row_key, 'UTF8') AS empid,
CONVERT_FROM(emp.personal_data.city, 'UTF8') AS city,
CONVERT_FROM(emp.personal_data.name, 'UTF8') AS name,
CONVERT_FROM(emp.professional_data.designation, 'UTF8') AS designation,
CONVERT_FROM(emp.professional_data.salary,'UTF8') AS salary
FROM hbase.`emp`;
6. Page6
RELOAD .BASHRC:
source~/.bashrc
(OR)
. ~/.bashrc
HBASE SAMPLE QUERIES
select * from hbase.`emp`;
SELECT CONVERT_FROM(row_key, 'UTF8') AS empid FROM hbase.`emp`;
SELECT CONVERT_FROM(row_key, 'UTF8') AS empid, CONVERT_FROM(emp.personal_data.city, 'UTF8') AS city FROM
hbase.`emp`;
SELECT CONVERT_FROM(emp.personal_data.city, 'UTF8') AS city, CONVERT_FROM(emp.personal_data.name, 'UTF8') As name
FROM hbase.`emp`;
SELECT CONVERT_FROM(row_key, 'UTF8') AS empid, CONVERT_FROM(emp.p ersonal_data.city, 'UTF8') AS city,
CONVERT_FROM(emp.personal_data.name, 'UTF8') As name, CONVERT_FROM(emp.professional_data.designation, 'UTF8') AS
designation,CONVERT_FROM(emp.professional_data.salary,'UTF8') As salary FROMhbase.`emp`;
ORACLE, HIVE AND HBASE (UNION ALL) QUERIES
select id,name, salary frommysql.userdb.`employee` union all selectid,first,salary fromoracle.MY_APPL.`emp`;
SELECT EID AS ID, NAME AS NAME, SALARY AS SALARY FROM hive.arvind.`employee` WHERE DESTINATION LIKE '%manager%'
UNION ALL
SELECT CONVERT_FROM(row_key, 'UTF8') AS ID, CONVERT_FROM(emp.personal_data.name, 'UTF8') AS NAME,
CONVERT_TO(emp.professional_data.salary,'UTF8') AS SALARY FROM hbase.`emp` WHERE
CONVERT_FROM(emp.professional_data.designation, 'UTF8') LIKE '%manager%';
SELECT EID AS ID, NAME AS NAME, TO_NUMBER(SALARY, '######') AS SALARY FROM hive.arvind.`employee` WHERE
DESTINATION LIKE '%manager%'
UNION ALL
SELECT ID AS ID, FIRST AS NAME, SALARY AS SALARY FROM oracle.MY_APPL.`emp`
UNION ALL
SELECT CONVERT_FROM(row_key, 'UTF8') AS ID, CONVERT_FROM(emp.personal_data.name, 'UTF8') AS NAME,
TO_NUMBER(emp.professional_data.salary,'######') AS SALARY FROM hbase.`emp` WHERE
CONVERT_FROM(emp.professional_data.designation, 'UTF8') LIKE '%manager%';