The document discusses Hive, a data warehouse infrastructure built on Hadoop. It provides an overview of Hive architecture and components. Key concepts discussed include Hive query language (HQL), creating and managing databases and tables in Hive, loading and querying data, partitioning tables for performance, and bucketing data for distributed processing. The document appears to be from a presentation or lecture on Hive and big data technologies.
The literature contains a myriad of recommendations, advice, and strictures about what data providers should do to facilitate data reuse. It can be overwhelming. Based on recent empirical work (analyzing data reuse proxies at scale, understanding data sensemaking and looking at how researchers search for data), I talk about what practices are a good place to start for helping others to reuse your data.
Content + Signals: The value of the entire data estate for machine learningPaul Groth
Content-centric organizations have increasingly recognized the value of their material for analytics and decision support systems based on machine learning. However, as anyone involved in machine learning projects will tell you the difficulty is not in the provision of the content itself but in the production of annotations necessary to make use of that content for ML. The transformation of content into training data often requires manual human annotation. This is expensive particularly when the nature of the content requires subject matter experts to be involved.
In this talk, I highlight emerging approaches to tackling this challenge using what's known as weak supervision - using other signals to help annotate data. I discuss how content companies often overlook resources that they have in-house to provide these signals. I aim to show how looking at a data estate in terms of signals can amplify its value for artificial intelligence.
The literature contains a myriad of recommendations, advice, and strictures about what data providers should do to facilitate data reuse. It can be overwhelming. Based on recent empirical work (analyzing data reuse proxies at scale, understanding data sensemaking and looking at how researchers search for data), I talk about what practices are a good place to start for helping others to reuse your data.
Content + Signals: The value of the entire data estate for machine learningPaul Groth
Content-centric organizations have increasingly recognized the value of their material for analytics and decision support systems based on machine learning. However, as anyone involved in machine learning projects will tell you the difficulty is not in the provision of the content itself but in the production of annotations necessary to make use of that content for ML. The transformation of content into training data often requires manual human annotation. This is expensive particularly when the nature of the content requires subject matter experts to be involved.
In this talk, I highlight emerging approaches to tackling this challenge using what's known as weak supervision - using other signals to help annotate data. I discuss how content companies often overlook resources that they have in-house to provide these signals. I aim to show how looking at a data estate in terms of signals can amplify its value for artificial intelligence.
The Roots: Linked data and the foundations of successful Agriculture DataPaul Groth
Some thoughts on successful data for the agricultural domain. Keynote at Linked Open Data in Agriculture
MACS-G20 Workshop in Berlin, September 27th and 28th, 2017 https://www.ktbl.de/inhalte/themen/ueber-uns/projekte/macs-g20-loda/lod/
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
Keynote given by Carole Goble on 23rd July 2013 at ISMB/ECCB 2013
http://www.iscb.org/ismbeccb2013
How could we evaluate research and researchers? Reproducibility underpins the scientific method: at least in principle if not practice. The willing exchange of results and the transparent conduct of research can only be expected up to a point in a competitive environment. Contributions to science are acknowledged, but not if the credit is for data curation or software. From a bioinformatics view point, how far could our results be reproducible before the pain is just too high? Is open science a dangerous, utopian vision or a legitimate, feasible expectation? How do we move bioinformatics from one where results are post-hoc "made reproducible", to pre-hoc "born reproducible"? And why, in our computational information age, do we communicate results through fragmented, fixed documents rather than cohesive, versioned releases? I will explore these questions drawing on 20 years of experience in both the development of technical infrastructure for Life Science and the social infrastructure in which Life Science operates.
IDCC Workshop: Analysing DMPs to inform research data services: lessons from ...Amanda Whitmire
A workshop as part of the International Digital Curation Conference 2016 on DMP development and support. This presentation demonstrates how we can use data management plans as a source of information to better understand researcher data stewardship practices and how to support them. Be sure to see the slide notes to better understand the presentation (most slides are just photos/icons).
Our regular Introduction to Data Management (DM) workshop (90-minutes). Covers very basic DM topics and concepts. Audience is graduate students from all disciplines. Most of the content is in the NOTES FIELD.
February 18 2015 NISO Virtual Conference Scientific Data Management: Caring for Your Institution and its Intellectual Wealth
Using data management plans as a research tool: an introduction to the DART Project
Amanda L. Whitmire, Ph.D., Assistant Professor, Data Management Specialist, Oregon State University Libraries & Press
Experimental Result Analysis of Text Categorization using Clustering and Clas...ijtsrd
In a world that routinely produces more textual data. It is very critical task to managing that textual data. There are many text analysis methods are available to managing and visualizing that data, but many techniques may give less accuracy because of the ambiguity of natural language. To provide the ne grained analysis, in this paper introduce e cient machine learning algorithms for categorize text data. To improve the accuracy, in proposed system I introduced Natural language toolkit NLTK python library to perform natural language processing. The main aim of proposed system is to generalize the model for real time text categorization applications by using e cient text classi cation as well as clustering machine learning algorithms and nd the efficient and accurate model for input dataset using performance measure concept. Patil Kiran Sanajy | Prof. Kurhade N. V. ""Experimental Result Analysis of Text Categorization using Clustering and Classification Algorithms"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd25077.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/25077/experimental-result-analysis-of-text-categorization-using-clustering-and-classification-algorithms/patil-kiran-sanajy
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsPaul Groth
A look at how the thinking about Web Data and the sources of semantics can help drive decisions on combining latent and explicit knowledge. Examples from Elsevier and lots of pointers to related work.
Indexing based Genetic Programming Approach to Record Deduplicationidescitation
In this paper, we present a genetic programming (GP) approach to record
deduplication with indexing techniques.Data de-duplication is a process in which data are
cleaned from duplicate records due to misspelling, field swap or any other mistake or data
inconsistency. This process requires that we identify objects that are included in more than
one list.The problem of detecting and eliminating duplicated data is one of the major
problems in the broad area of data cleaning and data quality in data warehouse. So, we
need to create such a algorithm that can detect and eliminate maximum duplications.GP
with indexing is one of the optimization technique that helps to find maximum duplicates in
the database. We used adeduplication function that is able to identify whether two or more
entries in a repository are replicas or not. As many industries and systems depend on the
accuracy and reliability of databases to carry out operations. Therefore, the quality of the
information stored in the databases, can have significant cost implications to a system that
relies on information to function and conduct business. Moreover, this is fact that clean and
replica-free repositories not only allow the retrieval of higher quality information but also
lead to more concise data and to potential savings in computational time and resources to
process this data.
Index
Keynote for Theory and Practice of Digital Libraries 2017
The theory and practice of digital libraries provides a long history of thought around how to manage knowledge ranging from collection development, to cataloging and resource description. These tools were all designed to make knowledge findable and accessible to people. Even technical progress in information retrieval and question answering are all targeted to helping answer a human’s information need.
However, increasingly demand is for data. Data that is needed not for people’s consumption but to drive machines. As an example of this demand, there has been explosive growth in job openings for Data Engineers – professionals who prepare data for machine consumption. In this talk, I overview the information needs of machine intelligence and ask the question: Are our knowledge management techniques applicable for serving this new consumer?
Research Data Sharing: A Basic FrameworkPaul Groth
Some thoughts on thinking about data sharing. Prepared for the 2016 LERU Doctoral Summer School - Data Stewardship for Scientific Discovery and Innovation.
http://www.dtls.nl/fair-data/fair-data-training/leru-summer-school/
Mining academic social network is becoming increasingly necessary with the increasing amount of data. It
is a favorite topic of research for many researchers. The data mining techniques are used for the mining of
academic social networks. In this paper, we are presenting an efficient frequent item set mining technique
for social academic network. The proposed framework first processes the research documents and then the
enhanced frequent item set mining is applied to find the strength of relationship between the researchers.
The proposed method will be fast in comparison to older algorithms. Also it will takes less main memory
space for computation purpose.
Prov-O-Viz is a visualisation service for provenance graphs expressed using the W3C PROV vocabulary. It uses the Sankey-style visualisation from D3js.
See http://provoviz.org
The need for a transparent data supply chainPaul Groth
Illustrating data supply chains and motivating the need for a more transparent data supply chain in the context of responsible data science. Presented at the 2018 KNAW-Royal Society bilateral meeting on responsible data science.
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)Stéphane Fréchette
How is Big Data moved around? How are you planning to move it?
This session will focus on familiar and not so similar tools you can use today
for moving and integrating Big Data. Also important to outline the technologies and platform (introduction to Big Data, Hadoop, HDInsight and tools).
We will compare and outline options,
discuss how they can work with your existing Hadoop and Windows Azure
environment, and provide some guidance on when and how to use each of these
tools.
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...Cloudera, Inc.
"This session will focus on the challenges of replacing existing Relational DataBase and Data Warehouse technologies with Open Source components. Jason Han will base his presentation on his experience migrating Korea Telecom (KT’s) CDR data from Oracle to Hadoop, which required converting many Oracle SQL queries to Hive HQL queries. He will cover the differences between SQL and HQL; the implementation of Oracle’s basic/analytics functions with MapReduce; the use of Sqoop for bulk loading RDB data into Hadoop; and the use of Apache Flume for collecting fast-streamed CDR data. He’ll also discuss Lucene and ElasticSearch for near-realtime distributed indexing and searching. You’ll learn tips for migrating existing enterprise big data to open source, and gain insight into whether this strategy is suitable for your own data.
The Roots: Linked data and the foundations of successful Agriculture DataPaul Groth
Some thoughts on successful data for the agricultural domain. Keynote at Linked Open Data in Agriculture
MACS-G20 Workshop in Berlin, September 27th and 28th, 2017 https://www.ktbl.de/inhalte/themen/ueber-uns/projekte/macs-g20-loda/lod/
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
Keynote given by Carole Goble on 23rd July 2013 at ISMB/ECCB 2013
http://www.iscb.org/ismbeccb2013
How could we evaluate research and researchers? Reproducibility underpins the scientific method: at least in principle if not practice. The willing exchange of results and the transparent conduct of research can only be expected up to a point in a competitive environment. Contributions to science are acknowledged, but not if the credit is for data curation or software. From a bioinformatics view point, how far could our results be reproducible before the pain is just too high? Is open science a dangerous, utopian vision or a legitimate, feasible expectation? How do we move bioinformatics from one where results are post-hoc "made reproducible", to pre-hoc "born reproducible"? And why, in our computational information age, do we communicate results through fragmented, fixed documents rather than cohesive, versioned releases? I will explore these questions drawing on 20 years of experience in both the development of technical infrastructure for Life Science and the social infrastructure in which Life Science operates.
IDCC Workshop: Analysing DMPs to inform research data services: lessons from ...Amanda Whitmire
A workshop as part of the International Digital Curation Conference 2016 on DMP development and support. This presentation demonstrates how we can use data management plans as a source of information to better understand researcher data stewardship practices and how to support them. Be sure to see the slide notes to better understand the presentation (most slides are just photos/icons).
Our regular Introduction to Data Management (DM) workshop (90-minutes). Covers very basic DM topics and concepts. Audience is graduate students from all disciplines. Most of the content is in the NOTES FIELD.
February 18 2015 NISO Virtual Conference Scientific Data Management: Caring for Your Institution and its Intellectual Wealth
Using data management plans as a research tool: an introduction to the DART Project
Amanda L. Whitmire, Ph.D., Assistant Professor, Data Management Specialist, Oregon State University Libraries & Press
Experimental Result Analysis of Text Categorization using Clustering and Clas...ijtsrd
In a world that routinely produces more textual data. It is very critical task to managing that textual data. There are many text analysis methods are available to managing and visualizing that data, but many techniques may give less accuracy because of the ambiguity of natural language. To provide the ne grained analysis, in this paper introduce e cient machine learning algorithms for categorize text data. To improve the accuracy, in proposed system I introduced Natural language toolkit NLTK python library to perform natural language processing. The main aim of proposed system is to generalize the model for real time text categorization applications by using e cient text classi cation as well as clustering machine learning algorithms and nd the efficient and accurate model for input dataset using performance measure concept. Patil Kiran Sanajy | Prof. Kurhade N. V. ""Experimental Result Analysis of Text Categorization using Clustering and Classification Algorithms"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd25077.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/25077/experimental-result-analysis-of-text-categorization-using-clustering-and-classification-algorithms/patil-kiran-sanajy
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsPaul Groth
A look at how the thinking about Web Data and the sources of semantics can help drive decisions on combining latent and explicit knowledge. Examples from Elsevier and lots of pointers to related work.
Indexing based Genetic Programming Approach to Record Deduplicationidescitation
In this paper, we present a genetic programming (GP) approach to record
deduplication with indexing techniques.Data de-duplication is a process in which data are
cleaned from duplicate records due to misspelling, field swap or any other mistake or data
inconsistency. This process requires that we identify objects that are included in more than
one list.The problem of detecting and eliminating duplicated data is one of the major
problems in the broad area of data cleaning and data quality in data warehouse. So, we
need to create such a algorithm that can detect and eliminate maximum duplications.GP
with indexing is one of the optimization technique that helps to find maximum duplicates in
the database. We used adeduplication function that is able to identify whether two or more
entries in a repository are replicas or not. As many industries and systems depend on the
accuracy and reliability of databases to carry out operations. Therefore, the quality of the
information stored in the databases, can have significant cost implications to a system that
relies on information to function and conduct business. Moreover, this is fact that clean and
replica-free repositories not only allow the retrieval of higher quality information but also
lead to more concise data and to potential savings in computational time and resources to
process this data.
Index
Keynote for Theory and Practice of Digital Libraries 2017
The theory and practice of digital libraries provides a long history of thought around how to manage knowledge ranging from collection development, to cataloging and resource description. These tools were all designed to make knowledge findable and accessible to people. Even technical progress in information retrieval and question answering are all targeted to helping answer a human’s information need.
However, increasingly demand is for data. Data that is needed not for people’s consumption but to drive machines. As an example of this demand, there has been explosive growth in job openings for Data Engineers – professionals who prepare data for machine consumption. In this talk, I overview the information needs of machine intelligence and ask the question: Are our knowledge management techniques applicable for serving this new consumer?
Research Data Sharing: A Basic FrameworkPaul Groth
Some thoughts on thinking about data sharing. Prepared for the 2016 LERU Doctoral Summer School - Data Stewardship for Scientific Discovery and Innovation.
http://www.dtls.nl/fair-data/fair-data-training/leru-summer-school/
Mining academic social network is becoming increasingly necessary with the increasing amount of data. It
is a favorite topic of research for many researchers. The data mining techniques are used for the mining of
academic social networks. In this paper, we are presenting an efficient frequent item set mining technique
for social academic network. The proposed framework first processes the research documents and then the
enhanced frequent item set mining is applied to find the strength of relationship between the researchers.
The proposed method will be fast in comparison to older algorithms. Also it will takes less main memory
space for computation purpose.
Prov-O-Viz is a visualisation service for provenance graphs expressed using the W3C PROV vocabulary. It uses the Sankey-style visualisation from D3js.
See http://provoviz.org
The need for a transparent data supply chainPaul Groth
Illustrating data supply chains and motivating the need for a more transparent data supply chain in the context of responsible data science. Presented at the 2018 KNAW-Royal Society bilateral meeting on responsible data science.
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)Stéphane Fréchette
How is Big Data moved around? How are you planning to move it?
This session will focus on familiar and not so similar tools you can use today
for moving and integrating Big Data. Also important to outline the technologies and platform (introduction to Big Data, Hadoop, HDInsight and tools).
We will compare and outline options,
discuss how they can work with your existing Hadoop and Windows Azure
environment, and provide some guidance on when and how to use each of these
tools.
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...Cloudera, Inc.
"This session will focus on the challenges of replacing existing Relational DataBase and Data Warehouse technologies with Open Source components. Jason Han will base his presentation on his experience migrating Korea Telecom (KT’s) CDR data from Oracle to Hadoop, which required converting many Oracle SQL queries to Hive HQL queries. He will cover the differences between SQL and HQL; the implementation of Oracle’s basic/analytics functions with MapReduce; the use of Sqoop for bulk loading RDB data into Hadoop; and the use of Apache Flume for collecting fast-streamed CDR data. He’ll also discuss Lucene and ElasticSearch for near-realtime distributed indexing and searching. You’ll learn tips for migrating existing enterprise big data to open source, and gain insight into whether this strategy is suitable for your own data.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Sumeet Singh
In the last eight years, the Hadoop grid infrastructure has allowed us to move towards a unified source of truth for all data at Yahoo that now accounts for over 450 petabytes of raw HDFS and 1.1 billion data files. Managing data location, schema knowledge and evolution, fine-grained business rules based access control, and audit and compliance needs have become critical with the increasing scale of operations.
In this talk, we will share our approach in tackling the above challenges with Apache HCatalog, a table and storage management layer for Hadoop. We will explain how to register existing HDFS files into HCatalog, provide broader but controlled access to data through a data discovery tool, and leverage existing Hadoop ecosystem components like Pig, Hive, HBase and Oozie to seamlessly share data across applications. Integration with data movement tools automates the availability of new data into HCatalog. In addition, the approach allows ever improving Hive performance to open up easy adhoc access to analyze and visualize data through SQL on Hadoop and popular BI tools.
As we discuss our approach, we will also highlight along how our approach minimizes data duplication, eliminates wasteful data retention, and solves for data provenance, lineage and integrity.
Apache PIG basics, commands to work on pig, advantages of pig, dis advantages of pig, pig wrapper classes, history of pig, components of pig, execution of pig, Pig latin basics, EVAL functions, Embedded pig, pig versus mapreduce, pig running environment, hadoop environment, logical plan, physical plan, ordering, joins
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
4. Structure DataStructure Data
Large Data SetLarge Data Set
MapreduceMapreduce Parallel
Distribution
Parallel
Distribution
Query DataQuery Data
Why HIVE
4Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
6. HDFS or HBASE STORAGE SYSTEM
Execution Engine
Hive QL Process Engine
WEB UIWEB UI
HIVE
COMMAND
LINE
HIVE
COMMAND
LINE
HD InsightHD Insight
Meta Store
User
Interface
HIVE Architecture
6Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
8. Hive File formats
• Text Files - Delimited by Parameters
• Sequence Files - Less Data
• RC Files - Analytic Processing
• ORC Files – Optimized file format in binary
format
8
Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT
2016
9. Hive query language offers:
Create Database
Create ,manage and partition tables
Supports various operators like Relational, Arithmetic and
Logical to evaluate functions
Hive supports DDL and DML
HIVE Query Language (HQL)
9
Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT
2016
10. DDL Data Definition Language)
Statements
The DDL commands are listed below
Create, Alter, Drop database
Create Alter, Drop, Truncate table
Create, Alter with Partitioning and Bucketing
Create Views
Show
Describe
10Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
11. Loading files
Inserting data into Hive Tables from queries
DML (Data Manipulation
Language) Statements
11Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
12. Database Operations
Syntax
CREATE DATABASE IF NOT EXISTS db_name
COMMENT ‘db_name Details’
WITH DBPROPERTIES (‘creator’ = ‘name’);
Example
CREATE DATABASE IF NOT EXISTS LIBDETS
COMMENT ’LIBRARY DETAILS’
WITH DBPROPERTIES (‘creator’ = ‘KIRUTHI’);
12Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
13. Database
OperationsSyntax
SHOW DATABASES // displays databases available
Example
SHOW DATABASES;
Syntax
DESCRIBE DATABASE db_name; //display Schema of database
DESCRIBE DATABASE EXTENDED db_name;
Example
DESCRIBE DATABASE LIBDETS;
DESCRIBE DATABASE EXTENDED LIBDETS
13Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
14. ALTER Database
Syntax
ALTER DATABASE db_name // Alter database properties
SET DBPROPERTIES (‘edited-by’ = ‘name’);
Example
ALTER DATABASE LIBDETS
SET DBPROPERTIES (‘edited-by’ = ‘KANI’);
14Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
15. USE , DROP Database
Syntax
USE db_name; //Assign database as current working database
Example
USE LIBDETS;
Syntax
DROP DATABASE db_name; // delete database
Example
DROP DATABASE LIBDETS;
15Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
16. TABLES
Hive supports two types of tables
Managed Table – Table stored in
HiveWarehouse folder
External Table – Retains a schema copy in
specified location even table is deleted
16Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
17. Creating Managed Table
Syntax
CREATE TABLE IF NOT EXISTS tb_name (column_name
data_type, column_name datatype,column_name data type)
ROW FORMAT DELIMITED FIELDS TERMINATED BY
‘t’ ;
Example
CREATE TABLE IF NOT EXISTS LIBTBL ( Member_Code
INT,Membr_Name STRING, Designation STRING,Dept_code
INT,dept_name STRING,group_name STRING,course_name
STRING,title STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY
‘t’ ;
Managed Table
17Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
18. External Table.
Creating External Table
Syntax
CREATE EXTERNAL TABLE tb_name IF NOT EXISTS
tb_name (column_name datatype, column_name datatype,
column_name datatype)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
LOCATION ‘ /home/usr/filename.format’;
Example
CREATE EXTERNAL TABLE IF NOT EXISTS LIBTBL
(Member_Code INT, Member_Name STRING, Designation
STRING, Dept_code INT, course_code INT, dept_name STRING,
group_name STRING, course_name STRING, title STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
LOCATION ‘/home/livrith/Desktop/Book2.csv’;
18Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
19. Loading Data into Table
Syntax
LOAD DATA LOCAL INPATH
‘hdfs_file_or_directory_path’
OVERWRITE INTO TABLE tb_name;
Example
LOAD DATA LOCAL INPATH
‘/home/kiruthika/Documents/Book2.csv’
OVERWRITE INTO TABLE LIBTBL;
19Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
20. Select clause
Syntax
SELET [ALL | DISTINCT] select_expr, select_expr, . . .
FROM tb_name
[WHERE where_conditon]
[GROUP BY column_name]
[ORDER BY column_name]
[HAVING having_condition]
[DISTRIBUTED column_name]
[LIMIT number];
Example:1
SELECT * FROM LIBTBL;
Example:2
SELECT Member Name, Designation FROM LIBTBL;
20Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
21. Select – where
Example
SELECT * FROM LIBUDET WHERE group_name =
‘TEACHING’
OR group_name = ‘student’
AND Dept_name>= ‘18’;
Select - regular expression
Syntax
SELECT column1,column2,column3 FROM tb_name WHERE
column_name LIKE ‘%alp%’;
Example
SELECT PRODUCT, STATE, CITY FROM SALESDETS
WHERE City LIKE ‘%O%’;
21Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
22. Group by
Example
SELECT PRODUCT, COUNT(PRODUCT)AS C1, STATE,
COUNTRY FROM SALESDETS GROUP BY PRODUCT,
STATE;
Order by // Sorts use only one reducer
Example
SELECT PRODUCT, STATE, PRICE, COUNTRY FROM
SALESDETS
ORDER BY COUNTRY;
22Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
23. Sort by // Sorts the data before given to reducer
Example
SELECT PRODUC,STATE,COUNTRY FROM SALESDETS
SORT BY COUNTRY
LIMIT 10;
Having // Filter data based on Group By
Example
SELECT PRODUCT, COUNT(PRODUCT) AS
C1,STATE,COUNTRY FROM SALESDETS
GROUP BY PRODUCT, STATE, COUNTRY
HAVING C1 > 5;
23Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
24. Limit
Example
SELECT PRODUCT,STATE, PRICE, COUNTRY FROM
SALESDETS COUNTRY LIMIT 10;
Distribute by // distributes rows among reducers
Syntax
SELECT column_name1, column_name2,column_name3 FROM
tb_name DISTRIBUTE BY column_name SORT BY column_name
ASC,column_name ASC LIMIT count;
Example
SELECT PRODUCT,PRICE,STATE FROM SALESDETS
DISTRIBUTE BY STATE
SORT BY STATE ASC, PRODUCT ASC
LIMIT 50;
24Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
25. Cluster by // does the job of both distribute by and sort by
Example
SELECT PRODUCT,PRICE,STATE FROM SALESDETS
CLUSTER BY STATE LIMIT 50;
Difference in Execution of Order By , Sort By, Distribute By, Cluster By
25Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
26. Data Aggregation
COUNT
AVG DISTINCT (AVG)
MIN DISTINCT(MIN)
MAX , DISTINCT(MAX)
26Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
27. Partitions
Hive reads the entire dataset from warehouse even when filter
condition is specified to fetch a particular column. This results as
bottleneck in MapReduce jobs and involves huge degree of I/O.
Partition command is used to break larger dataset into small
chunks on columns.
Hive supports two types of partition
Static partition
Dynamic partition
27Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
28. Creating partition table
Syntax
CREATE TABLE tb_name (column1 datatype, column2
datatype,column3 datatype)
COMMENT ‘Details of the dataset’
PARTITIONED BY (column_name STRING) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ‘,’;
Example
CREATE TABLE MY_TABLE1 (Member_Name STRING,dept_name
STRING,group_name STRING,course_name STRING,title STRING)
COMMENT ‘User information’ PARTITIONED BY (Designation
STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY
‘,’;
28Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
29. Load data into static partition table
Syntax
LOAD DATA LOCAL INPATH ‘file_path’ OVERWRITE
INTO TABLE tb_name;
Example
LOAD DATA LOCAL INPATH
‘/home/livrith/Desktop/mytab.csv’ OVERWRITE INTO
TABLE MY_TABLE2;
29Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
30. Set dynamic partition
The following setting has to be modified to execute
dynamic partitions.
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Example
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
30Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
31. Insert data - Dynamic partition table
Syntax
INSERT OVERWRITE TABLE 1st
_tb_name
PARTITION(column_name) SELECT
column_name1,column_name2,column_name3 FROM
2nd
_tb_name;
//partition field should be the last attribute when inserting data
Example
INSERT OVERWRITE TABLE MY_TABLE1
PARTITION(Designation)
SELECT Member_Name,dept_name,group_name,
course_name,title,Designation FROM MY_TABLE2;
31Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
33. Bucketing
Bucketing is similar to partitioning.
Bucket is a file.
Bucket are used to create partition on specified column values
where as partitioning is used to divided data into small blocks on
columns.
33
Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
34. Table creation
Syntax
CREATE TABLE IF NOT EXISTS tb_name (column1
datatype,column2 datatype,column3 datatype) CLUSTER
BY(column_name) into 3 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY
‘/t’;
Example
CREATE TABLE SALES_BUC1 (Transacyion_date
TIMESTAMP,Product STRING,Price INT,Payment_Type
STRING,Name STRING,City STRING,State STRING,Country
STRING,Account_Created TIMESTAMP) CLUSTERED BY
(Price) into 3 BUCKETS ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’;
34
Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
35. Load data into table
Syntax
FROM 1st
_tb_name INSERT OVERWRITE TABLE
2nd
_tb_name
SELECT column_name1, column_name2,column_name3;
Example
FROM SALESDETS INSERT OVERWRITE TABLE
SALES_BUC1 SELECT
Transaction_date,Product,Price,Payment_Type,Name,City,Sta
te,Country,Account_Created;
35Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
36. Select from bucket table
Syntax:1
SELECT DISTINCT column_name FROM 2nd
_tb_name
tb_name (BUCKET 1 OUT OF 3 ON column_name);
Example
SELECT DISTINCT Price FROM SALES_BUC1
TABLESAMPLE (BUCKET 1 OUT OF 3 ON PRICE);
Syntax:2
SELECT DISTINCT column_name FROM tb_name2
Tb_name(BUCKET 1 OUT OF 2 ON column_name);
Example
SELECT DISTINCT PRICE FROM SALES_BUC1
TABLESAMPLE(BUCKET 1 OUT OF 2 ON Price);
36
Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
37. Sampling
•SAMPLING is used in hive to populate small dataset from
the existing large datasets. Sampling employs selects records
randomly to create small datasets.
Syntax
SELECT COUNT(*) FROM tb_name TABLESAMPLE
(BUCKET 1 OUT OF 3 ON column_name);
Example
In the example given below sample are created from the table
sales_buc from the available 3 buckets.
SELECT COUNT(*) FROM SALES_BUC TABLESAMPLE
(BUCKET 1 OUT OF 3 ON Price);
37Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
38. • Apache HBase is an open-source, distributed, versioned,
non-relational database modeled after Google's Bigtable
• Apache HBase provides Bigtable-like capabilities on top
of Hadoop and HDFS.
38
Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
39. NoSQL Databases
• NoSQL – Not only SQL, Non Relational/Non
SQL Databases
• SCHEMA LESS
• Ideology
• BASE – Basically available Eventual
Consistency - Only can support two
availabilty, replication
39
Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
40. NoSQL Types
• Key Value Store - Amazon S3, Riak
• Document based store – CouchDB,MongoDB
• Column based store - Hbase, Cassandra
• Graph based stores - Neoj4, Orientdb
40
Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
41. HBASE is Not
• Table with one primary key (row key)
• No Join Operations
• Limited Atomicty and transaction support
• Manipulated by SQL
41
Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016
42. Hbase components
• Master - Manages load balancing and scripting
• Regionserver – Range of tables assigned by master
Zookeper –
• Client communicate via Zookeeper for read write
operations in region servers for storing node details
• Region server uses Memstore similar to cache
memory
• Provides services for synchronization, maintenance
42
Dr.V.Bhuvaneswari, Asst.Professor, Dept. of Comp. Appll., Bharathiar University,- WDABT 2016