SlideShare a Scribd company logo
Aadhaar dataset analysis
using big data hadoop
●
Name- Abhishek Verma
●
Submitted to- Eckovation
●
Course-summer internship program in
computer science and IT.
2
TECHNOLOGIES USED
●
Cloudera virtual machine running on cent
os using virtual box.
●
HDFS(hadoop distributed file system).
●
Linux shell terminal.
●
Apache Hive.
3
Procedure or steps
taken.
●
Using hadoop HDFS to transfer
the ‘.csv’ file from local file
system into the hadoop HDFS.
●
Entering hive shell and creating
table.
●
Transferring the data in HDFS to
hive.
●
Performing data analysis on the
data inside the table using hive
querries.
4
Using hadoop to transfer
file from local file system in
to the HDFS
●
File adhar.csv (csv stands for comma
separated file) is downloaded from the
UIDAI website.
●
Commands are run in terminal of cloudera
machine.
●
Command for entering a file from local file
sysetem into hadoop is-- hadoop fs -
copyFromLocal /”path of the file”. So in our
case the full command is as follows.
●
hadoop fs -copyFromLocal
/home/cloudera/Desktop/adhar.csv
5
Entering hive shell and
creating table
●
The command to enter into hive shell is “hive” without
quotes.
●
Once in hive shell a database is required to work upon by
default there is a default database but it is recommended to
make a new database for a new project.
●
Command to create a new database is create databse
“database name”; which in our case is create databse
project3;
●
Entering/using the database using command-- USE
project3;
●
Creating table inside hive with formats for each column.
Using command – CREATE TABLE adhar_dat3 ( registrar
STRING, Enrolment_Agency STRING, State STRING,
District STRING, Sub_District STRING, Pin_Code
STRING,Gender STRING, Age STRING,
Aadhaar_generated INT, Enrolment_Rejected
INT,Residents_providing_email INT,
Residents_providing_mobile_number INT) ROW
FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
TERMINATED BY 'n' STORED AS TEXTFILE;
6
Transfering data from
HADOOP to hive table.
●
Once the table is defined the data can be
entered from either local filesyste or HDFS in
our case HDFS is used.
●
The command for loading data from hadoop
HDFS to hive table is give as --LOAD DATA
INPATH '/user/cloudera/adhar.csv'
adhar_dat3;
●
Where “/user/cloudera/adhar.csv” is the file
location in HDFS and adhar_dat3 is the name
of hive table which was defined earlier.
7
Performing analysis on data
using queries
●
Several queries can be performed on the data
according to the need of the user.
Queries executed in this case are as follows.
●
To find the no of aadhaar generated by each State.
●
No of total aadhaar based on gender as
distinguishing factor.
●
Average age of an aadhaar applicant from each
state of country.
●
Gives the name of enrollment agencies who
rejected at least one aadhaar application along
with the no of application rejected by the
respective agencies.
●
Gives the minimum age of applicant from each
state whose enrollment was accepted.
●
To find the no of aadhaar generated by each
District.
8
To find the no of aadhaar generated
by each State.
select State,count(Aadhaar_generated) AS cnt
from adhar_dat3 group by State ;
9
No of total aadhaar based on gender
as distinguishing factor.
select Gender,count(Aadhaar_generated) AS cnt
from adhar_dat3 group by Gender;
10
Average age of an aadhaar
applicant from each state of
country.
SELECT State, round(avg(Age),1) as r1 FROM
adhar_dat3 GROUP BY State ORDER BY r1;
11
Gives the name of enrollment
agencies who rejected atleast one
aadhaar application along with the
no of application rejected by the
respective agencies.
select
Enrolment_Agency,count(Enrolment_Rejected)
from adhar_dat3 where(Enrolment_Rejected=1)
group by Enrolment_Agency;
12
Gives the maximum age of
applicant from each state whose
enrollment was accepted.
select State,max(Age) AS cnt from adhar_dat3
where(Enrolment_Rejected=0) group by State;
13
To find the no of aadhaar
generated by each District.
select District,count(Aadhaar_generated) AS cnt
from adhar_dat3 group by District ;
14
conclusion
●
Hadoop makes it possible to analyze data
that is otherwise impossible to analyze due
to its huge size.
●
Map reduce scipts are applied to the data in
hdfs to obtain required info from huge data
sets or weblogs.
●
Apart from classic map scripts which is
written in java and require to make a jar file
to work with out data the hive,pig, etc are
easier to write because of its similarities to
that of SQL.
●
Spark and impalla are emerging technologies
that may very well replace hadoop map
reduce because map reduce does not offer
real time processing and is 100 times slower
as claimed by Apache spark.

More Related Content

What's hot

Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
Rohit Agrawal
 
Database administrator
Database administratorDatabase administrator
Database administratorTech_MX
 
Advanced SQL - Database Access from Programming Languages
Advanced SQL - Database Access  from Programming LanguagesAdvanced SQL - Database Access  from Programming Languages
Advanced SQL - Database Access from Programming Languages
S.Shayan Daneshvar
 
Key Challenges In CLOUD COMPUTING
Key Challenges In CLOUD COMPUTINGKey Challenges In CLOUD COMPUTING
Key Challenges In CLOUD COMPUTING
Atul Chounde
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
File organization and indexing
File organization and indexingFile organization and indexing
File organization and indexing
raveena sharma
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guide
thomasmary607
 
Unit 1-Cloud computing Foundation
Unit 1-Cloud computing FoundationUnit 1-Cloud computing Foundation
Unit 1-Cloud computing Foundation
MonishaNehkal
 
Top 10 cloud service providers
Top 10 cloud service providersTop 10 cloud service providers
Top 10 cloud service providers
Vineet Garg
 
Voldemort
VoldemortVoldemort
Voldemort
fasiha ikram
 
Oracle
OracleOracle
Cloud computing Basics
Cloud computing BasicsCloud computing Basics
Cloud computing Basics
Sagar Sane
 
Object oriented database concepts
Object oriented database conceptsObject oriented database concepts
Object oriented database concepts
Temesgenthanks
 
Cloud Computing- components, working, pros and cons
Cloud Computing- components, working, pros and consCloud Computing- components, working, pros and cons
Cloud Computing- components, working, pros and cons
Amritpal Singh Bedi
 
Object Oriented Database Management System
Object Oriented Database Management SystemObject Oriented Database Management System
Object Oriented Database Management System
Ajay Jha
 
Ibm db2
Ibm db2Ibm db2
Ibm db2
aditi212
 

What's hot (20)

Hive and HiveQL - Module6
Hive and HiveQL - Module6Hive and HiveQL - Module6
Hive and HiveQL - Module6
 
Database administrator
Database administratorDatabase administrator
Database administrator
 
Advanced SQL - Database Access from Programming Languages
Advanced SQL - Database Access  from Programming LanguagesAdvanced SQL - Database Access  from Programming Languages
Advanced SQL - Database Access from Programming Languages
 
Key Challenges In CLOUD COMPUTING
Key Challenges In CLOUD COMPUTINGKey Challenges In CLOUD COMPUTING
Key Challenges In CLOUD COMPUTING
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
File organization and indexing
File organization and indexingFile organization and indexing
File organization and indexing
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guide
 
Unit 1-Cloud computing Foundation
Unit 1-Cloud computing FoundationUnit 1-Cloud computing Foundation
Unit 1-Cloud computing Foundation
 
Top 10 cloud service providers
Top 10 cloud service providersTop 10 cloud service providers
Top 10 cloud service providers
 
Voldemort
VoldemortVoldemort
Voldemort
 
Oracle
OracleOracle
Oracle
 
Cloud computing Basics
Cloud computing BasicsCloud computing Basics
Cloud computing Basics
 
Object oriented database concepts
Object oriented database conceptsObject oriented database concepts
Object oriented database concepts
 
Cloud Computing- components, working, pros and cons
Cloud Computing- components, working, pros and consCloud Computing- components, working, pros and cons
Cloud Computing- components, working, pros and cons
 
Object Oriented Database Management System
Object Oriented Database Management SystemObject Oriented Database Management System
Object Oriented Database Management System
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Ibm db2
Ibm db2Ibm db2
Ibm db2
 
Dbms models
Dbms modelsDbms models
Dbms models
 
Hadoop
HadoopHadoop
Hadoop
 

Similar to report on aadhaar anlysis using bid data hadoop and hive

Big Data & Hadoop Data Analysis
Big Data & Hadoop Data AnalysisBig Data & Hadoop Data Analysis
Big Data & Hadoop Data Analysis
Koushik Mondal
 
Case study ap log collector
Case study ap log collectorCase study ap log collector
Case study ap log collectorJyun-Yao Huang
 
Big Data Summer training presentation
Big Data Summer training presentationBig Data Summer training presentation
Big Data Summer training presentation
HarshitaKamboj
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
IJRESJOURNAL
 
Hadoop paper
Hadoop paperHadoop paper
Hadoop paper
ATWIINE Simon Alex
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseHenk van der Valk
 
Session 04 pig - slides
Session 04   pig - slidesSession 04   pig - slides
Session 04 pig - slides
AnandMHadoop
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
Ambuj Kumar
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
Amar kumar
 
R server and spark
R server and sparkR server and spark
R server and spark
BAINIDA
 
Hadoop
HadoopHadoop
Hadoop data access layer v4.0
Hadoop data access layer v4.0Hadoop data access layer v4.0
Hadoop data access layer v4.0
SpringPeople
 
Apache hive
Apache hiveApache hive
Apache hive
Ayapparaj SKS
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
JasmineMichael1
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentationAmrut Patil
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopNushrat
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
Sonal Tiwari
 

Similar to report on aadhaar anlysis using bid data hadoop and hive (20)

Big Data & Hadoop Data Analysis
Big Data & Hadoop Data AnalysisBig Data & Hadoop Data Analysis
Big Data & Hadoop Data Analysis
 
Case study ap log collector
Case study ap log collectorCase study ap log collector
Case study ap log collector
 
Big Data Summer training presentation
Big Data Summer training presentationBig Data Summer training presentation
Big Data Summer training presentation
 
InternReport
InternReportInternReport
InternReport
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
Hadoop paper
Hadoop paperHadoop paper
Hadoop paper
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL Polybase
 
Unit 1
Unit 1Unit 1
Unit 1
 
Session 04 pig - slides
Session 04   pig - slidesSession 04   pig - slides
Session 04 pig - slides
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
R server and spark
R server and sparkR server and spark
R server and spark
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop data access layer v4.0
Hadoop data access layer v4.0Hadoop data access layer v4.0
Hadoop data access layer v4.0
 
Apache hive
Apache hiveApache hive
Apache hive
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Finding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache HadoopFinding URL pattern with MapReduce and Apache Hadoop
Finding URL pattern with MapReduce and Apache Hadoop
 
Big Data and Hadoop Basics
Big Data and Hadoop BasicsBig Data and Hadoop Basics
Big Data and Hadoop Basics
 

Recently uploaded

一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 

Recently uploaded (20)

一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 

report on aadhaar anlysis using bid data hadoop and hive

  • 1. Aadhaar dataset analysis using big data hadoop ● Name- Abhishek Verma ● Submitted to- Eckovation ● Course-summer internship program in computer science and IT.
  • 2. 2 TECHNOLOGIES USED ● Cloudera virtual machine running on cent os using virtual box. ● HDFS(hadoop distributed file system). ● Linux shell terminal. ● Apache Hive.
  • 3. 3 Procedure or steps taken. ● Using hadoop HDFS to transfer the ‘.csv’ file from local file system into the hadoop HDFS. ● Entering hive shell and creating table. ● Transferring the data in HDFS to hive. ● Performing data analysis on the data inside the table using hive querries.
  • 4. 4 Using hadoop to transfer file from local file system in to the HDFS ● File adhar.csv (csv stands for comma separated file) is downloaded from the UIDAI website. ● Commands are run in terminal of cloudera machine. ● Command for entering a file from local file sysetem into hadoop is-- hadoop fs - copyFromLocal /”path of the file”. So in our case the full command is as follows. ● hadoop fs -copyFromLocal /home/cloudera/Desktop/adhar.csv
  • 5. 5 Entering hive shell and creating table ● The command to enter into hive shell is “hive” without quotes. ● Once in hive shell a database is required to work upon by default there is a default database but it is recommended to make a new database for a new project. ● Command to create a new database is create databse “database name”; which in our case is create databse project3; ● Entering/using the database using command-- USE project3; ● Creating table inside hive with formats for each column. Using command – CREATE TABLE adhar_dat3 ( registrar STRING, Enrolment_Agency STRING, State STRING, District STRING, Sub_District STRING, Pin_Code STRING,Gender STRING, Age STRING, Aadhaar_generated INT, Enrolment_Rejected INT,Residents_providing_email INT, Residents_providing_mobile_number INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n' STORED AS TEXTFILE;
  • 6. 6 Transfering data from HADOOP to hive table. ● Once the table is defined the data can be entered from either local filesyste or HDFS in our case HDFS is used. ● The command for loading data from hadoop HDFS to hive table is give as --LOAD DATA INPATH '/user/cloudera/adhar.csv' adhar_dat3; ● Where “/user/cloudera/adhar.csv” is the file location in HDFS and adhar_dat3 is the name of hive table which was defined earlier.
  • 7. 7 Performing analysis on data using queries ● Several queries can be performed on the data according to the need of the user. Queries executed in this case are as follows. ● To find the no of aadhaar generated by each State. ● No of total aadhaar based on gender as distinguishing factor. ● Average age of an aadhaar applicant from each state of country. ● Gives the name of enrollment agencies who rejected at least one aadhaar application along with the no of application rejected by the respective agencies. ● Gives the minimum age of applicant from each state whose enrollment was accepted. ● To find the no of aadhaar generated by each District.
  • 8. 8 To find the no of aadhaar generated by each State. select State,count(Aadhaar_generated) AS cnt from adhar_dat3 group by State ;
  • 9. 9 No of total aadhaar based on gender as distinguishing factor. select Gender,count(Aadhaar_generated) AS cnt from adhar_dat3 group by Gender;
  • 10. 10 Average age of an aadhaar applicant from each state of country. SELECT State, round(avg(Age),1) as r1 FROM adhar_dat3 GROUP BY State ORDER BY r1;
  • 11. 11 Gives the name of enrollment agencies who rejected atleast one aadhaar application along with the no of application rejected by the respective agencies. select Enrolment_Agency,count(Enrolment_Rejected) from adhar_dat3 where(Enrolment_Rejected=1) group by Enrolment_Agency;
  • 12. 12 Gives the maximum age of applicant from each state whose enrollment was accepted. select State,max(Age) AS cnt from adhar_dat3 where(Enrolment_Rejected=0) group by State;
  • 13. 13 To find the no of aadhaar generated by each District. select District,count(Aadhaar_generated) AS cnt from adhar_dat3 group by District ;
  • 14. 14 conclusion ● Hadoop makes it possible to analyze data that is otherwise impossible to analyze due to its huge size. ● Map reduce scipts are applied to the data in hdfs to obtain required info from huge data sets or weblogs. ● Apart from classic map scripts which is written in java and require to make a jar file to work with out data the hive,pig, etc are easier to write because of its similarities to that of SQL. ● Spark and impalla are emerging technologies that may very well replace hadoop map reduce because map reduce does not offer real time processing and is 100 times slower as claimed by Apache spark.