Hive and data analysis using pandas

•Download as PPTX, PDF•

0 likes•17 views

This document discusses analyzing data using Hadoop (HiveQL) and Python Pandas. It provides an overview of Hadoop and Hive architecture and components. It then presents a use case of analyzing stack overflow data stored in HDFS using HiveQL and loading the results into Pandas for further analysis. Specifically, it shows how to find the top 10 users on the data.stackexchange.com site based on their total scores by joining Hive tables, exporting to CSV, and importing into Pandas to plot a bar graph of the results.

Data & Analytics

Hadoop (HiveQL)
&
Data Analysis using Pandas
Purna Chander Rao.Kathula

Agenda
● Introduction to Bigdata and Hadoop
● Understanding Hive and its components.
● Hive Architecture
● Use case of stackoverflow ( datascience.stackexchange.com).
● Reporting with Pandas

Data
Volumes ( KB, MB, GB, TB, PB …… )
● Structured
○ Tabular rows and columns ( Database) ( Supports GB’s ...)
○ DWH ( Tera Data systems) and BI ( Supports TB’s )
● Semi- structured
○ Excel, XML, Json, Logs and etc...
● Un Structured
○ Audio, Video, Image and etc...

Big Data - ( Problem ) ---------------> Hadoop - ( Solution )
BIG DATA
Storage
Processing
HDFS
MapReduce

Hadoop Architecture
Master - Slave Architecture

HDFS
Hadoop Distributed File System.
1. Data Replication. ( 3 times by default)
2. 64 mb Block size. ( Current windows 8 system is 4kb)
3. Unix Like commands but use - (hyphen) before the command.

Hadoop Services
Hadoop Services
1. NameNode
2. DataNode
3. Secondary Name Node
4. Job Tracker.
5. Task Tracker

Hive
● What is Hive?
● Hive is a data warehouse infrastructure built on top of hadoop that can compile SQL queries as
MapReduce Jobs
Hive is not
● A relational database
● A design for OnLine Transaction Processing (OLTP)
● A language for real-time queries and row-level updates
Features of Hive
● It stores schema in a database and processed data into HDFS.
● It is designed for OLAP.
● It provides SQL type language for querying called HiveQL or HQL.
● It is familiar, fast, scalable, and extensible.

How does Hive Work
● Hive is built on top of Hadoop
● Hive stores data in HDFS
● Hive is Schema on Read not on Write
● Hive compile SQL Queries into Mapreduce jobs and run the jobs in
Hadoop cluster

Working with XML Files
Here i am taking the example of StackOverflow dataset called
Datascience.stackexchange.com

Output of Hive MR
Copy the output to local directory and rename it as results.csv , Now we load
the csv to Pandas for Data Analysis

Python Pandas
Python pandas is an open source library providing high-performance, easy-to-
use data structures and data analysis tools for the Python programming
language
Problem : The problem here is to find the top 10 users on
data.stackexchange .com

Contd ..
The loaded data does not have any headers , so we include the headers by
using the names parameter as below.

Contd ..
Here df is the dataframe object which controls the entire dataset. We can
control each column by using the Series

Contd ..
Finding the top 10 Scores and their Users

Plotting Graph (“Bar”)
%matplotlib inline
ax = df.groupby(by = ['displayname', 'id'])['score'].sum().sort_values(ascending = False).head(10).
plot(kind='bar',figsize=(10,5), ylim = (50,500), title = "Top 10 users", grid = True, colormap='jet' )
ax.set_xlabel ( "DisplayNames and UsersID")
ax.set_ylabel ("Scores")

Similar to Hive and data analysis using pandas

Presentation sreenu dwh-servicesSreenu Musham

Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training

Apache Hive for modern DBAsLuis Marques

Hadoop File system (HDFS)Prashant Gupta

Hadoop and mysql by Chris SchneiderDmitry Makarchuk

מיכאלsqlserver.co.il

Hadoop: An Industry PerspectiveCloudera, Inc.

Lecture 2 part 1Jazan University

Map-Reduce and Apache HadoopSvetlin Nakov

Building data pipelinesJonathan Holloway

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri

Hadoop basicsAntonio Silveira

Big data conceptsSerkan Özal

Slide 2 collecting, storing and analyzing big dataTrieu Nguyen

HADOOPHarinder Kaur

Nextag talkJoydeep Sen Sarma

Big Data Architecture and DeploymentCisco Canada

Cisco connect toronto 2015 big data sean mc keownCisco Canada

BIG DATA: Apache HadoopOleksiy Krotov

P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP

Similar to Hive and data analysis using pandas (20)

Presentation sreenu dwh-services

Module 01 - Understanding Big Data and Hadoop 1.x,2.x

Apache Hive for modern DBAs

Hadoop File system (HDFS)

Hadoop and mysql by Chris Schneider

מיכאל

Hadoop: An Industry Perspective

Lecture 2 part 1

Map-Reduce and Apache Hadoop

Building data pipelines

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...

Hadoop basics

Big data concepts

Slide 2 collecting, storing and analyzing big data

HADOOP

Nextag talk

Big Data Architecture and Deployment

Cisco connect toronto 2015 big data sean mc keown

BIG DATA: Apache Hadoop

P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.

Recently uploaded

ASML's Taxonomy Adventure by Daniel Cantervoginip

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

Data Science Jobs and Salaries Analysis.pptxFurkanTasci3

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk

9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha

Recently uploaded (20)

ASML's Taxonomy Adventure by Daniel Canter

DBA Basics: Getting Started with Performance Tuning.pdf

Customer Service Analytics - Make Sense of All Your Data.pptx

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

Data Science Jobs and Salaries Analysis.pptx

Brighton SEO | April 2024 | Data Storytelling

RA-11058_IRR-COMPRESS Do 198 series of 1998

Call Girls in Saket 99530🔝 56974 Escort Service

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

Call Girls In Mahipalpur O9654467111 Escorts Service

毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree

Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝

20240419 - Measurecamp Amsterdam - SAM.pdf

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样

9654467111 Call Girls In Munirka Hotel And Home Service

Hive and data analysis using pandas

1. Hadoop (HiveQL) & Data Analysis using Pandas Purna Chander Rao.Kathula

2. Agenda ● Introduction to Bigdata and Hadoop ● Understanding Hive and its components. ● Hive Architecture ● Use case of stackoverflow ( datascience.stackexchange.com). ● Reporting with Pandas

3. Data Volumes ( KB, MB, GB, TB, PB …… ) ● Structured ○ Tabular rows and columns ( Database) ( Supports GB’s ...) ○ DWH ( Tera Data systems) and BI ( Supports TB’s ) ● Semi- structured ○ Excel, XML, Json, Logs and etc... ● Un Structured ○ Audio, Video, Image and etc...

4. BigData

5. Big Data - ( Problem ) ---------------> Hadoop - ( Solution ) BIG DATA Storage Processing HDFS MapReduce

6. Hadoop Architecture Master - Slave Architecture

7. HDFS Hadoop Distributed File System. 1. Data Replication. ( 3 times by default) 2. 64 mb Block size. ( Current windows 8 system is 4kb) 3. Unix Like commands but use - (hyphen) before the command.

8. Rack Awareness

9. Hadoop Services Hadoop Services 1. NameNode 2. DataNode 3. Secondary Name Node 4. Job Tracker. 5. Task Tracker

10. Hive ● What is Hive? ● Hive is a data warehouse infrastructure built on top of hadoop that can compile SQL queries as MapReduce Jobs Hive is not ● A relational database ● A design for OnLine Transaction Processing (OLTP) ● A language for real-time queries and row-level updates Features of Hive ● It stores schema in a database and processed data into HDFS. ● It is designed for OLAP. ● It provides SQL type language for querying called HiveQL or HQL. ● It is familiar, fast, scalable, and extensible.

11. How does Hive Work ● Hive is built on top of Hadoop ● Hive stores data in HDFS ● Hive is Schema on Read not on Write ● Hive compile SQL Queries into Mapreduce jobs and run the jobs in Hadoop cluster

12. Hive Architecture

13. Create table Hive

14. Storing Schema in MySQL

15. Working with XML Files Here i am taking the example of StackOverflow dataset called Datascience.stackexchange.com

16. Create Table Users

17. Create Table Posts

18. Show Tables and load Data

19. Show Tables and load Data ( Contd..)

20. Joining 2 Tables (Users and Posts)

21. Output of Hive MR Copy the output to local directory and rename it as results.csv , Now we load the csv to Pandas for Data Analysis

22. Python Pandas Python pandas is an open source library providing high-performance, easy-to- use data structures and data analysis tools for the Python programming language Problem : The problem here is to find the top 10 users on data.stackexchange .com

23. Loading data into iPython Notebook

24. Contd .. The loaded data does not have any headers , so we include the headers by using the names parameter as below.

25. Contd .. Here df is the dataframe object which controls the entire dataset. We can control each column by using the Series

26. Contd .. Finding the top 10 Scores and their Users

27. Plotting Graph (“Bar”) %matplotlib inline ax = df.groupby(by = ['displayname', 'id'])['score'].sum().sort_values(ascending = False).head(10). plot(kind='bar',figsize=(10,5), ylim = (50,500), title = "Top 10 users", grid = True, colormap='jet' ) ax.set_xlabel ( "DisplayNames and UsersID") ax.set_ylabel ("Scores")

28. THANK YOU

Hive and data analysis using pandas

Recommended

Recommended

More Related Content

Similar to Hive and data analysis using pandas

Similar to Hive and data analysis using pandas (20)

Recently uploaded

Recently uploaded (20)

Hive and data analysis using pandas