Hive and data analysis using pandas

•Download as PPTX, PDF•

4 likes•2,595 views

Working with Hive and finding the data insights of datascience.stackoverflow.com , Problem : Find the top 10 Users on datasceicne.stackexchange.com

Data & Analytics

Hadoop (HiveQL)
&
Data Analysis using Pandas
Purna Chander Rao.Kathula

Agenda
● Introduction to Bigdata and Hadoop
● Understanding Hive and its components.
● Hive Architecture
● Use case of stackoverflow ( datascience.stackexchange.com).
● Reporting with Pandas

Data
Volumes ( KB, MB, GB, TB, PB …… )
● Structured
○ Tabular rows and columns ( Database) ( Supports GB’s ...)
○ DWH ( Tera Data systems) and BI ( Supports TB’s )
● Semi- structured
○ Excel, XML, Json, Logs and etc...
● Un Structured
○ Audio, Video, Image and etc...

Big Data - ( Problem ) ---------------> Hadoop - ( Solution )
BIG DATA
Storage
Processing
HDFS
MapReduce

Hadoop Architecture
Master - Slave Architecture

HDFS
Hadoop Distributed File System.
1. Data Replication. ( 3 times by default)
2. 64 mb Block size. ( Current windows 8 system is 4kb)
3. Unix Like commands but use - (hyphen) before the command.

Hadoop Services
Hadoop Services
1. NameNode
2. DataNode
3. Secondary Name Node
4. Job Tracker.
5. Task Tracker

Hive
● What is Hive?
● Hive is a data warehouse infrastructure built on top of hadoop that can compile SQL queries as
MapReduce Jobs
Hive is not
● A relational database
● A design for OnLine Transaction Processing (OLTP)
● A language for real-time queries and row-level updates
Features of Hive
● It stores schema in a database and processed data into HDFS.
● It is designed for OLAP.
● It provides SQL type language for querying called HiveQL or HQL.
● It is familiar, fast, scalable, and extensible.

How does Hive Work
● Hive is built on top of Hadoop
● Hive stores data in HDFS
● Hive is Schema on Read not on Write
● Hive compile SQL Queries into Mapreduce jobs and run the jobs in
Hadoop cluster

Working with XML Files
Here i am taking the example of StackOverflow dataset called
Datascience.stackexchange.com

Output of Hive MR
Copy the output to local directory and rename it as results.csv , Now we load
the csv to Pandas for Data Analysis

Python Pandas
Python pandas is an open source library providing high-performance, easy-to-
use data structures and data analysis tools for the Python programming
language
Problem : The problem here is to find the top 10 users on
data.stackexchange .com

Contd ..
The loaded data does not have any headers , so we include the headers by
using the names parameter as below.

Contd ..
Here df is the dataframe object which controls the entire dataset. We can
control each column by using the Series

Contd ..
Finding the top 10 Scores and their Users

Plotting Graph (“Bar”)
%matplotlib inline
ax = df.groupby(by = ['displayname', 'id'])['score'].sum().sort_values(ascending = False).head(10).
plot(kind='bar',figsize=(10,5), ylim = (50,500), title = "Top 10 users", grid = True, colormap='jet' )
ax.set_xlabel ( "DisplayNames and UsersID")
ax.set_ylabel ("Scores")

What's hot

The Evolution of the Hadoop EcosystemCloudera, Inc.

Hadoop overviewSiva Pandeti

HiveManas Nayak

Intro to Hadoopjeffturner

Hadoop TechnologiesKannappan Sirchabesan

Unit 5-apache hivevishal choudhary

Hadoop and mysql by Chris SchneiderDmitry Makarchuk

HadoopNishant Gandhi

Hadoop-IntroductionSandeep Deshmukh

Introduction to Hadoopjoelcrabb

Hive(ppt)Abhinav Tyagi

introduction to data processing using Hadoop and PigRicardo Varela

Hadoop TechnologyAtul Kushwaha

Big data and HadoopRahul Agarwal

Hadoop - OverviewJay

6.hivePrashant Gupta

Apache hive introductionMahmood Reza Esmaili Zand

Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.

Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar

What's hot (19)

The Evolution of the Hadoop Ecosystem

Hadoop overview

Hive

Intro to Hadoop

Hadoop Technologies

Unit 5-apache hive

Hadoop and mysql by Chris Schneider

Hadoop

Hadoop-Introduction

Introduction to Hadoop

Hive(ppt)

introduction to data processing using Hadoop and Pig

Hadoop Technology

Big data and Hadoop

Hadoop - Overview

6.hive

Apache hive introduction

Hw09 Hadoop Development At Facebook Hive And Hdfs

Practical Problem Solving with Apache Hadoop & Pig

Similar to Hive and data analysis using pandas

Processing Big Data: An Introduction to Data Intensive ComputingCollin Bennett

Hadoop and big data trainingagiamas

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev

EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.

Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.

Hadoop: Distributed Data ProcessingCloudera, Inc.

Hadoop_arunam_pptjerrin joseph

Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG

Presentation sreenu dwh-servicesSreenu Musham

Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training

Apache Hive for modern DBAsLuis Marques

Hadoop File system (HDFS)Prashant Gupta

מיכאלsqlserver.co.il

Hadoop: An Industry PerspectiveCloudera, Inc.

Lecture 2 part 1Jazan University

Map-Reduce and Apache HadoopSvetlin Nakov

Building data pipelinesJonathan Holloway

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri

Hadoop basicsAntonio Silveira

Big data conceptsSerkan Özal

Similar to Hive and data analysis using pandas (20)

Processing Big Data: An Introduction to Data Intensive Computing

Hadoop and big data training

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015

EclipseCon Keynote: Apache Hadoop - An Introduction

Sf NoSQL MeetUp: Apache Hadoop and HBase

Hadoop: Distributed Data Processing

Hadoop_arunam_ppt

Hadoop ecosystem framework n hadoop in live environment

Presentation sreenu dwh-services

Module 01 - Understanding Big Data and Hadoop 1.x,2.x

Apache Hive for modern DBAs

Hadoop File system (HDFS)

מיכאל

Hadoop: An Industry Perspective

Lecture 2 part 1

Map-Reduce and Apache Hadoop

Building data pipelines

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...

Hadoop basics

Big data concepts

Recently uploaded

ASML's Taxonomy Adventure by Daniel Cantervoginip

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

Industrialised data - the key to AI success.pdfLars Albertsson

Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort

Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha

办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava

Recently uploaded (20)

ASML's Taxonomy Adventure by Daniel Canter

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改

04242024_CCC TUG_Joins and Relationships

Industrialised data - the key to AI success.pdf

Generative AI for Social Good at Open Data Science East 2024

20240419 - Measurecamp Amsterdam - SAM.pdf

9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service

Call Girls In Dwarka 9654467111 Escorts Service

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

RABBIT: A CLI tool for identifying bots based on their GitHub events.

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...

办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps

Call Girls in Saket 99530🔝 56974 Escort Service

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

PKS-TGC-1084-630 - Stage 1 Proposal.pptx

Hive and data analysis using pandas

1. Hadoop (HiveQL) & Data Analysis using Pandas Purna Chander Rao.Kathula

2. Agenda ● Introduction to Bigdata and Hadoop ● Understanding Hive and its components. ● Hive Architecture ● Use case of stackoverflow ( datascience.stackexchange.com). ● Reporting with Pandas

3. Data Volumes ( KB, MB, GB, TB, PB …… ) ● Structured ○ Tabular rows and columns ( Database) ( Supports GB’s ...) ○ DWH ( Tera Data systems) and BI ( Supports TB’s ) ● Semi- structured ○ Excel, XML, Json, Logs and etc... ● Un Structured ○ Audio, Video, Image and etc...

4. BigData

5. Big Data - ( Problem ) ---------------> Hadoop - ( Solution ) BIG DATA Storage Processing HDFS MapReduce

6. Hadoop Architecture Master - Slave Architecture

7. HDFS Hadoop Distributed File System. 1. Data Replication. ( 3 times by default) 2. 64 mb Block size. ( Current windows 8 system is 4kb) 3. Unix Like commands but use - (hyphen) before the command.

8. Rack Awareness

9. Hadoop Services Hadoop Services 1. NameNode 2. DataNode 3. Secondary Name Node 4. Job Tracker. 5. Task Tracker

10. Hive ● What is Hive? ● Hive is a data warehouse infrastructure built on top of hadoop that can compile SQL queries as MapReduce Jobs Hive is not ● A relational database ● A design for OnLine Transaction Processing (OLTP) ● A language for real-time queries and row-level updates Features of Hive ● It stores schema in a database and processed data into HDFS. ● It is designed for OLAP. ● It provides SQL type language for querying called HiveQL or HQL. ● It is familiar, fast, scalable, and extensible.

11. How does Hive Work ● Hive is built on top of Hadoop ● Hive stores data in HDFS ● Hive is Schema on Read not on Write ● Hive compile SQL Queries into Mapreduce jobs and run the jobs in Hadoop cluster

12. Hive Architecture

13. Create table Hive

14. Storing Schema in MySQL

15. Working with XML Files Here i am taking the example of StackOverflow dataset called Datascience.stackexchange.com

16. Create Table Users

17. Create Table Posts

18. Show Tables and load Data

19. Show Tables and load Data ( Contd..)

20. Joining 2 Tables (Users and Posts)

21. Output of Hive MR Copy the output to local directory and rename it as results.csv , Now we load the csv to Pandas for Data Analysis

22. Python Pandas Python pandas is an open source library providing high-performance, easy-to- use data structures and data analysis tools for the Python programming language Problem : The problem here is to find the top 10 users on data.stackexchange .com

23. Loading data into iPython Notebook

24. Contd .. The loaded data does not have any headers , so we include the headers by using the names parameter as below.

25. Contd .. Here df is the dataframe object which controls the entire dataset. We can control each column by using the Series

26. Contd .. Finding the top 10 Scores and their Users

27. Plotting Graph (“Bar”) %matplotlib inline ax = df.groupby(by = ['displayname', 'id'])['score'].sum().sort_values(ascending = False).head(10). plot(kind='bar',figsize=(10,5), ylim = (50,500), title = "Top 10 users", grid = True, colormap='jet' ) ax.set_xlabel ( "DisplayNames and UsersID") ax.set_ylabel ("Scores")

28. THANK YOU

Hive and data analysis using pandas

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Hive and data analysis using pandas

Similar to Hive and data analysis using pandas (20)

Recently uploaded

Recently uploaded (20)

Hive and data analysis using pandas