This document discusses analyzing data using Hadoop (HiveQL) and Python Pandas. It provides an overview of Hadoop and Hive architecture and components. It then presents a use case of analyzing stack overflow data stored in HDFS using HiveQL and loading the results into Pandas for further analysis. Specifically, it shows how to find the top 10 users on the data.stackexchange.com site based on their total scores by joining Hive tables, exporting to CSV, and importing into Pandas to plot a bar graph of the results.
2. Agenda
● Introduction to Bigdata and Hadoop
● Understanding Hive and its components.
● Hive Architecture
● Use case of stackoverflow ( datascience.stackexchange.com).
● Reporting with Pandas
3. Data
Volumes ( KB, MB, GB, TB, PB …… )
● Structured
○ Tabular rows and columns ( Database) ( Supports GB’s ...)
○ DWH ( Tera Data systems) and BI ( Supports TB’s )
● Semi- structured
○ Excel, XML, Json, Logs and etc...
● Un Structured
○ Audio, Video, Image and etc...
7. HDFS
Hadoop Distributed File System.
1. Data Replication. ( 3 times by default)
2. 64 mb Block size. ( Current windows 8 system is 4kb)
3. Unix Like commands but use - (hyphen) before the command.
10. Hive
● What is Hive?
● Hive is a data warehouse infrastructure built on top of hadoop that can compile SQL queries as
MapReduce Jobs
Hive is not
● A relational database
● A design for OnLine Transaction Processing (OLTP)
● A language for real-time queries and row-level updates
Features of Hive
● It stores schema in a database and processed data into HDFS.
● It is designed for OLAP.
● It provides SQL type language for querying called HiveQL or HQL.
● It is familiar, fast, scalable, and extensible.
11. How does Hive Work
● Hive is built on top of Hadoop
● Hive stores data in HDFS
● Hive is Schema on Read not on Write
● Hive compile SQL Queries into Mapreduce jobs and run the jobs in
Hadoop cluster
21. Output of Hive MR
Copy the output to local directory and rename it as results.csv , Now we load
the csv to Pandas for Data Analysis
22. Python Pandas
Python pandas is an open source library providing high-performance, easy-to-
use data structures and data analysis tools for the Python programming
language
Problem : The problem here is to find the top 10 users on
data.stackexchange .com