SlideShare a Scribd company logo
Analyzing HDFS Files using
Apace Spark and
Mapreduce
FixedLengthInputFormat
leoricklin@gmail.com
the source code is here
MAPREDUCE-1176: FixedLengthInputFormat and
FixedLengthRecordReader (fixed in 2.3.0)
by Mariappan Asokan, BitsOfInfo
Addition of FixedLengthInputFormat and FixedLengthRecordReader in the org.
apache.hadoop.mapreduce.lib.input package. These two classes can be used
when you need to read data from files containing fixed length (fixed width) records.
Such files have no CR/LF (or any combination thereof), no delimiters etc, but each
record is a fixed length, and extra data is padded with spaces.
One 2GB gigantic line within a file issue
[stackoverflow] Considering the String class' length method returns an int, the
maximum length that would be returned by the method would be Integer.
MAX_VALUE, which is 2^31
- 1 (or approximately 2 billion.)
In terms of lengths and indexing of arrays, (such as char[], which is probably the
way the internal data representation is implemented for Strings),...
val rdd = sc.textFile("hdfs:///user/leo/test.txt/nolr2G-1.txt")
rdd.count
...
ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 236)
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
hdfs file
org.apache.hadoop.mapreduce.lib.input.TextInputFormat
: An InputFormat for plain text files. Files are broken into lines. Either linefeed or
carriage-return are used to signal end of line. Keys are the position in the file, and
values are the line of text.
$ hdfs dfs -cat example.txt
King Henry the Fourth.
Henry, Prince of Wales, son to the King.
Prince John of Lancaster, son to the King.
Earl of Westmoreland.
Sir Walter Blunt.
Thomas Percy, Earl of Worcester.
Henry Percy, Earl of Northumberland.
Henry Percy, surnamed Hotspur, his son.
...
line 1
line 2
line n
hdfs block 1
hdfs block 2
line 5
hdfs block n
Split & Record
❏ An input split is a chunk of the input that is processed by a single map.
❏ Each split is divided into records, and the map processes each record—a key-
value pair—in turn.
❏ By default the split size is dfs.block.size.[1]
HDFS Text File
Split 1 (block 1) Split 2 (block 2)
Record 1
(line 1)
Record 2
(line 2)
Record 3
(line 3)
Record 4
(line 4)
Record 5
(line 5)
Record 6
(line 6)
HDFS Text File
Split Split Split Split Split Split Split
Record > 2GB
CR/LF
Normal
text file
One 2GB
gigantic
line
within a
file
[1] Tom White, “Hadoop:The Definitive Guide, 3rd Edtion”, p.234, 2012
File
③combine last part of previous block with first part of current block
Split
Check length of records with FixedLengthInputFormat (1)
fixed blk fixed blk
Record Record Record
len len len len
len len len
①Read blocks of fixed length except the last one
②split blocks by n and compute length of each record
HDFS File
Split Split
Record Record Record Record Record Record
CR/LF
blk blk blk blk blk blk blk blk blk blk blk blk blk blk blk
Split-1 Split-2
len len len
Record Record Record
len
len len
⑤sort splits by reading position
④group splits by each file
len
⑥combine last part of previous block with first part of next block
Split
Check length of records with FixedLengthInputFormat (2)
HDFS File
Split Split Split Split Split Split Split
Record > 2GB
blk blk blk blk blk blk blk blk blk blk blk blk blk blk blk
Fixed Length
Record
len len
len
①Read blocks of fixed length except the last one
②split blocks by n and compute length of each record
➂combine last part of previous block with first part of next block
File
Record
④group splits by file
⑤sort splits by reading position
⑥combine last part of previous block with first part of next block
len len
fixed blk fixed blk Split-1 Split-2
len
Validation (1)
$ hdfs dfs -ls /user/leo/test/
-rw-r--r-- 1 leo leo 2147483669 2015-10-06 07:40 /user/leo/test/nolr2G-1.txt
-rw-r--r-- 1 leo leo 2147483669 2015-10-06 09:19 /user/leo/test/nolr2G-2.txt
-rw-r--r-- 1 leo leo 2147483669 2015-10-07 00:53 /user/leo/test/nolr2G-3.txt
$ hdfs dfs -cat /user/leo/test/nolr2G-1.txt
01234567890123456789...........0123456789
0123456789
scala> recordLenOfFile.map{ case (path, stat) => f"[${path}][${stat.toString()}]"}.collect().
foreach(println)
...
INFO TaskSetManager: Finished task 47.0 in stage 9.0 (TID 223) in 16 ms on localhost (48/48)
...
[hdfs://sandbox.hortonworks.com:8020/user/leo/test/nolr2G-1.txt][stats: (count: 1, mean:
2147483648.000000, stdev: 0.000000, max: 2147483648.000000, min: 2147483648.000000), NaN: 0]
[hdfs://sandbox.hortonworks.com:8020/user/leo/test/nolr2G-2.txt][stats: (count: 1, mean:
2147483648.000000, stdev: 0.000000, max: 2147483648.000000, min: 2147483648.000000), NaN: 0]
[hdfs://sandbox.hortonworks.com:8020/user/leo/test/nolr2G-3.txt][stats: (count: 1, mean:
2147483648.000000, stdev: 0.000000, max: 2147483648.000000, min: 2147483648.000000), NaN: 0]
2GB + 10 bytes + “n” + 10 bytes
The output shows how many lines and
the statistics for the length of lines in each
file.
Here we found there exists one line of
2147483648 chars.
Validation (2)
$ hdfs dfs -ls /user/leo/test.2/
-rw-r--r-- 1 leo leo 5258688 2015-10-07 06:56 test.2/all-bible-1.txt
-rw-r--r-- 1 leo leo 5258688 2015-10-07 06:56 test.2/all-bible-2.txt
-rwxr-xr-x 1 leo leo 5258688 2015-10-06 02:12 test.2/all-bible-3.txt
$ hdfs dfs -cat test.2/all-bible-1.txt|wc -l
117154
scala> recordLenOfFile.map{ case (path, stat) => f"[${path}][${stat.toString()}]"}.collect().
foreach(println)
...
INFO TaskSetManager: Finished task 0.0 in stage 13.0 (TID 233) in 115 ms on localhost (3/3)
...
[hdfs://sandbox.hortonworks.com:8020/user/leo/test.2/all-bible-2.txt][stats: (count: 116854,
mean: 43.866937, stdev: 30.647162, max: 88.000000, min: 1.000000), NaN: 0]
[hdfs://sandbox.hortonworks.com:8020/user/leo/test.2/all-bible-3.txt][stats: (count: 116854,
mean: 43.866937, stdev: 30.647162, max: 88.000000, min: 1.000000), NaN: 0]
[hdfs://sandbox.hortonworks.com:8020/user/leo/test.2/all-bible-1.txt][stats: (count: 116854,
mean: 43.866937, stdev: 30.647162, max: 88.000000, min: 1.000000), NaN: 0]

More Related Content

What's hot

Quick guide of the most common linux commands
Quick guide of the most common linux commandsQuick guide of the most common linux commands
Quick guide of the most common linux commands
Carlos Enrique
 
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingLess is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Zhe Zhang
 
COM1407: File Processing
COM1407: File Processing COM1407: File Processing
COM1407: File Processing
Hemantha Kulathilake
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
tutorialvillage
 
Dremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasetsDremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasets
Hung-yu Lin
 
2
22
Linux commands
Linux commandsLinux commands
Linux commands
bhatvijetha
 
SGN Introduction to UNIX Command-line 2015 part 2
SGN Introduction to UNIX Command-line 2015 part 2SGN Introduction to UNIX Command-line 2015 part 2
SGN Introduction to UNIX Command-line 2015 part 2
solgenomics
 
Hadoop basic commands
Hadoop basic commandsHadoop basic commands
Hadoop basic commands
bispsolutions
 
Linux
LinuxLinux
SGN Introduction to UNIX Command-line 2015 part 1
SGN Introduction to UNIX Command-line 2015 part 1SGN Introduction to UNIX Command-line 2015 part 1
SGN Introduction to UNIX Command-line 2015 part 1
solgenomics
 
Text Mining with R
Text Mining with RText Mining with R
Text Mining with R
Sanjay Mishra
 
Linux comands for Hadoop
Linux comands for HadoopLinux comands for Hadoop
Linux comands for Hadoop
PM Venkatesha Babu
 
fileop report
fileop reportfileop report
fileop report
Jason Lu
 
Profile of NPOESS HDF5 Files
Profile of NPOESS HDF5 FilesProfile of NPOESS HDF5 Files
Profile of NPOESS HDF5 Files
The HDF-EOS Tools and Information Center
 
keti companion classifier
keti companion classifierketi companion classifier
keti companion classifier
JEE HYUN PARK
 
Filelist
FilelistFilelist
Filelist
NeelBca
 

What's hot (17)

Quick guide of the most common linux commands
Quick guide of the most common linux commandsQuick guide of the most common linux commands
Quick guide of the most common linux commands
 
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingLess is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
 
COM1407: File Processing
COM1407: File Processing COM1407: File Processing
COM1407: File Processing
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Dremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasetsDremel: interactive analysis of web-scale datasets
Dremel: interactive analysis of web-scale datasets
 
2
22
2
 
Linux commands
Linux commandsLinux commands
Linux commands
 
SGN Introduction to UNIX Command-line 2015 part 2
SGN Introduction to UNIX Command-line 2015 part 2SGN Introduction to UNIX Command-line 2015 part 2
SGN Introduction to UNIX Command-line 2015 part 2
 
Hadoop basic commands
Hadoop basic commandsHadoop basic commands
Hadoop basic commands
 
Linux
LinuxLinux
Linux
 
SGN Introduction to UNIX Command-line 2015 part 1
SGN Introduction to UNIX Command-line 2015 part 1SGN Introduction to UNIX Command-line 2015 part 1
SGN Introduction to UNIX Command-line 2015 part 1
 
Text Mining with R
Text Mining with RText Mining with R
Text Mining with R
 
Linux comands for Hadoop
Linux comands for HadoopLinux comands for Hadoop
Linux comands for Hadoop
 
fileop report
fileop reportfileop report
fileop report
 
Profile of NPOESS HDF5 Files
Profile of NPOESS HDF5 FilesProfile of NPOESS HDF5 Files
Profile of NPOESS HDF5 Files
 
keti companion classifier
keti companion classifierketi companion classifier
keti companion classifier
 
Filelist
FilelistFilelist
Filelist
 

Similar to analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat

2023comp90024_linux2.pdf
2023comp90024_linux2.pdf2023comp90024_linux2.pdf
2023comp90024_linux2.pdf
LevLafayette1
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filters
Acácio Oliveira
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filters
Acácio Oliveira
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filters
Acácio Oliveira
 
Report
ReportReport
3.2 process text streams using filters
3.2 process text streams using filters3.2 process text streams using filters
3.2 process text streams using filters
Acácio Oliveira
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
Mahendran Ponnusamy
 
Licão 06 process text streams with filters
Licão 06 process text streams with filtersLicão 06 process text streams with filters
Licão 06 process text streams with filters
Acácio Oliveira
 
gLite Data Management System
gLite Data Management SystemgLite Data Management System
gLite Data Management System
Leandro Ciuffo
 
Rar
RarRar
Linux basic commands
Linux basic commandsLinux basic commands
Linux basic commands
MohanKumar Palanichamy
 
04 Cache Memory
04  Cache  Memory04  Cache  Memory
04 Cache Memory
Jeanie Delos Arcos
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
acogoluegnes
 
Rar
RarRar
Linux
LinuxLinux
Linux basic commands
Linux basic commandsLinux basic commands
Linux basic commands
Teja Bheemanapally
 
OS_Ch12
OS_Ch12OS_Ch12
OSCh12
OSCh12OSCh12
Ch12 OS
Ch12 OSCh12 OS
Ch12 OS
C.U
 
Inputs of physical design
Inputs of physical designInputs of physical design
Inputs of physical design
Kishore Sai Addanki
 

Similar to analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat (20)

2023comp90024_linux2.pdf
2023comp90024_linux2.pdf2023comp90024_linux2.pdf
2023comp90024_linux2.pdf
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filters
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filters
 
101 3.2 process text streams using filters
101 3.2 process text streams using filters101 3.2 process text streams using filters
101 3.2 process text streams using filters
 
Report
ReportReport
Report
 
3.2 process text streams using filters
3.2 process text streams using filters3.2 process text streams using filters
3.2 process text streams using filters
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Licão 06 process text streams with filters
Licão 06 process text streams with filtersLicão 06 process text streams with filters
Licão 06 process text streams with filters
 
gLite Data Management System
gLite Data Management SystemgLite Data Management System
gLite Data Management System
 
Rar
RarRar
Rar
 
Linux basic commands
Linux basic commandsLinux basic commands
Linux basic commands
 
04 Cache Memory
04  Cache  Memory04  Cache  Memory
04 Cache Memory
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Rar
RarRar
Rar
 
Linux
LinuxLinux
Linux
 
Linux basic commands
Linux basic commandsLinux basic commands
Linux basic commands
 
OS_Ch12
OS_Ch12OS_Ch12
OS_Ch12
 
OSCh12
OSCh12OSCh12
OSCh12
 
Ch12 OS
Ch12 OSCh12 OS
Ch12 OS
 
Inputs of physical design
Inputs of physical designInputs of physical design
Inputs of physical design
 

More from leorick lin

How to prepare for pca certification 2021
How to prepare for pca certification 2021How to prepare for pca certification 2021
How to prepare for pca certification 2021
leorick lin
 
1.5.ensemble learning with apache spark m llib 1.5
1.5.ensemble learning with apache spark m llib 1.51.5.ensemble learning with apache spark m llib 1.5
1.5.ensemble learning with apache spark m llib 1.5
leorick lin
 
1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml
leorick lin
 
Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3
leorick lin
 
Email Classifier using Spark 1.3 Mlib / ML Pipeline
Email Classifier using Spark 1.3 Mlib / ML PipelineEmail Classifier using Spark 1.3 Mlib / ML Pipeline
Email Classifier using Spark 1.3 Mlib / ML Pipeline
leorick lin
 
Integrating data stored in rdbms and hadoop
Integrating data stored in rdbms and hadoopIntegrating data stored in rdbms and hadoop
Integrating data stored in rdbms and hadoop
leorick lin
 

More from leorick lin (6)

How to prepare for pca certification 2021
How to prepare for pca certification 2021How to prepare for pca certification 2021
How to prepare for pca certification 2021
 
1.5.ensemble learning with apache spark m llib 1.5
1.5.ensemble learning with apache spark m llib 1.51.5.ensemble learning with apache spark m llib 1.5
1.5.ensemble learning with apache spark m llib 1.5
 
1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml1.5.recommending music with apache spark ml
1.5.recommending music with apache spark ml
 
Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3Multiclassification with Decision Tree in Spark MLlib 1.3
Multiclassification with Decision Tree in Spark MLlib 1.3
 
Email Classifier using Spark 1.3 Mlib / ML Pipeline
Email Classifier using Spark 1.3 Mlib / ML PipelineEmail Classifier using Spark 1.3 Mlib / ML Pipeline
Email Classifier using Spark 1.3 Mlib / ML Pipeline
 
Integrating data stored in rdbms and hadoop
Integrating data stored in rdbms and hadoopIntegrating data stored in rdbms and hadoop
Integrating data stored in rdbms and hadoop
 

Recently uploaded

20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 

Recently uploaded (20)

20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 

analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat

  • 1. Analyzing HDFS Files using Apace Spark and Mapreduce FixedLengthInputFormat leoricklin@gmail.com the source code is here
  • 2. MAPREDUCE-1176: FixedLengthInputFormat and FixedLengthRecordReader (fixed in 2.3.0) by Mariappan Asokan, BitsOfInfo Addition of FixedLengthInputFormat and FixedLengthRecordReader in the org. apache.hadoop.mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces.
  • 3. One 2GB gigantic line within a file issue [stackoverflow] Considering the String class' length method returns an int, the maximum length that would be returned by the method would be Integer. MAX_VALUE, which is 2^31 - 1 (or approximately 2 billion.) In terms of lengths and indexing of arrays, (such as char[], which is probably the way the internal data representation is implemented for Strings),... val rdd = sc.textFile("hdfs:///user/leo/test.txt/nolr2G-1.txt") rdd.count ... ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 236) java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271)
  • 4. hdfs file org.apache.hadoop.mapreduce.lib.input.TextInputFormat : An InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text. $ hdfs dfs -cat example.txt King Henry the Fourth. Henry, Prince of Wales, son to the King. Prince John of Lancaster, son to the King. Earl of Westmoreland. Sir Walter Blunt. Thomas Percy, Earl of Worcester. Henry Percy, Earl of Northumberland. Henry Percy, surnamed Hotspur, his son. ... line 1 line 2 line n hdfs block 1 hdfs block 2 line 5 hdfs block n
  • 5. Split & Record ❏ An input split is a chunk of the input that is processed by a single map. ❏ Each split is divided into records, and the map processes each record—a key- value pair—in turn. ❏ By default the split size is dfs.block.size.[1] HDFS Text File Split 1 (block 1) Split 2 (block 2) Record 1 (line 1) Record 2 (line 2) Record 3 (line 3) Record 4 (line 4) Record 5 (line 5) Record 6 (line 6) HDFS Text File Split Split Split Split Split Split Split Record > 2GB CR/LF Normal text file One 2GB gigantic line within a file [1] Tom White, “Hadoop:The Definitive Guide, 3rd Edtion”, p.234, 2012
  • 6. File ③combine last part of previous block with first part of current block Split Check length of records with FixedLengthInputFormat (1) fixed blk fixed blk Record Record Record len len len len len len len ①Read blocks of fixed length except the last one ②split blocks by n and compute length of each record HDFS File Split Split Record Record Record Record Record Record CR/LF blk blk blk blk blk blk blk blk blk blk blk blk blk blk blk Split-1 Split-2 len len len Record Record Record len len len ⑤sort splits by reading position ④group splits by each file len ⑥combine last part of previous block with first part of next block
  • 7. Split Check length of records with FixedLengthInputFormat (2) HDFS File Split Split Split Split Split Split Split Record > 2GB blk blk blk blk blk blk blk blk blk blk blk blk blk blk blk Fixed Length Record len len len ①Read blocks of fixed length except the last one ②split blocks by n and compute length of each record ➂combine last part of previous block with first part of next block File Record ④group splits by file ⑤sort splits by reading position ⑥combine last part of previous block with first part of next block len len fixed blk fixed blk Split-1 Split-2 len
  • 8. Validation (1) $ hdfs dfs -ls /user/leo/test/ -rw-r--r-- 1 leo leo 2147483669 2015-10-06 07:40 /user/leo/test/nolr2G-1.txt -rw-r--r-- 1 leo leo 2147483669 2015-10-06 09:19 /user/leo/test/nolr2G-2.txt -rw-r--r-- 1 leo leo 2147483669 2015-10-07 00:53 /user/leo/test/nolr2G-3.txt $ hdfs dfs -cat /user/leo/test/nolr2G-1.txt 01234567890123456789...........0123456789 0123456789 scala> recordLenOfFile.map{ case (path, stat) => f"[${path}][${stat.toString()}]"}.collect(). foreach(println) ... INFO TaskSetManager: Finished task 47.0 in stage 9.0 (TID 223) in 16 ms on localhost (48/48) ... [hdfs://sandbox.hortonworks.com:8020/user/leo/test/nolr2G-1.txt][stats: (count: 1, mean: 2147483648.000000, stdev: 0.000000, max: 2147483648.000000, min: 2147483648.000000), NaN: 0] [hdfs://sandbox.hortonworks.com:8020/user/leo/test/nolr2G-2.txt][stats: (count: 1, mean: 2147483648.000000, stdev: 0.000000, max: 2147483648.000000, min: 2147483648.000000), NaN: 0] [hdfs://sandbox.hortonworks.com:8020/user/leo/test/nolr2G-3.txt][stats: (count: 1, mean: 2147483648.000000, stdev: 0.000000, max: 2147483648.000000, min: 2147483648.000000), NaN: 0] 2GB + 10 bytes + “n” + 10 bytes The output shows how many lines and the statistics for the length of lines in each file. Here we found there exists one line of 2147483648 chars.
  • 9. Validation (2) $ hdfs dfs -ls /user/leo/test.2/ -rw-r--r-- 1 leo leo 5258688 2015-10-07 06:56 test.2/all-bible-1.txt -rw-r--r-- 1 leo leo 5258688 2015-10-07 06:56 test.2/all-bible-2.txt -rwxr-xr-x 1 leo leo 5258688 2015-10-06 02:12 test.2/all-bible-3.txt $ hdfs dfs -cat test.2/all-bible-1.txt|wc -l 117154 scala> recordLenOfFile.map{ case (path, stat) => f"[${path}][${stat.toString()}]"}.collect(). foreach(println) ... INFO TaskSetManager: Finished task 0.0 in stage 13.0 (TID 233) in 115 ms on localhost (3/3) ... [hdfs://sandbox.hortonworks.com:8020/user/leo/test.2/all-bible-2.txt][stats: (count: 116854, mean: 43.866937, stdev: 30.647162, max: 88.000000, min: 1.000000), NaN: 0] [hdfs://sandbox.hortonworks.com:8020/user/leo/test.2/all-bible-3.txt][stats: (count: 116854, mean: 43.866937, stdev: 30.647162, max: 88.000000, min: 1.000000), NaN: 0] [hdfs://sandbox.hortonworks.com:8020/user/leo/test.2/all-bible-1.txt][stats: (count: 116854, mean: 43.866937, stdev: 30.647162, max: 88.000000, min: 1.000000), NaN: 0]