Your SlideShare is downloading. ×
3 Approaches To Big Data Analysis With Apache Hadoop
3 Approaches To Big Data Analysis With Apache Hadoop
3 Approaches To Big Data Analysis With Apache Hadoop
3 Approaches To Big Data Analysis With Apache Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

3 Approaches To Big Data Analysis With Apache Hadoop

243

Published on

The Apache™ Hadoop® ecosystem provides a rich source of utilities that are key to helping enterprises unlock vital insights from large data sets. Discover how three powerful data analysis tools match …

The Apache™ Hadoop® ecosystem provides a rich source of utilities that are key to helping enterprises unlock vital insights from large data sets. Discover how three powerful data analysis tools match up.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
243
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  1. 54 2014 Issue 01 | Dell.com/powersolutions Features Reprinted from Dell Power Solutions, 2014 Issue 1. Copyright © 2014 Dell Inc. All rights reserved. L og files from web servers represent a treasure trove of data that enterprises can mine to gain a deep understanding of customer shopping habits, social media use, web advertisement effectiveness and other metrics that inform business decisions. Each click from a web page can create on the order of 100 bytes of data in a typical website log. Consequently, large websites handling millions of simultaneous visitors can generate hundreds of gigabytes or even terabytes of logs per day. Ferreting out nuggets of valuable information from this mass of data can require very sophisticated algorithms. To analyze big data, many organizations turn to open-source utilities found in the Apache Hadoop ecosystem. The choice of a particular tool depends on the needs of the analysis, the skill set of the data analyst, and the trade-off between development time and execution time. Three commonly used tools for analyzing data resident in Apache HDFS™ (Hadoop Distributed File System) include the Hadoop MapReduce framework, Apache Hive™ data warehousing software and the Apache Pig™ platform. MapReduce requires a computer program — often written in the Oracle® Java® programming language — to read, analyze and output data. Hive provides a SQL-like front end well suited for analysts with a database background, who view data in terms of tables and joins. And the Pig platform includes a high-level language for data processing that enables the analyst to exploit the parallelism inherent in a Hadoop cluster. Understanding website visitors through log file analysis To compare the performance of the three tools, Dell engineers created a program for each tool that tackled a simple log file analysis task: measuring the amount of traffic coming to the website by country of origin on an hour-by-hour basis during an average day. (For more information, see the sidebar, “Configuration details.”) The test analyzed files in the standard Apache HTTP Server log format (see figure). The first component of the log file is the remote IP address, which the programs used to determine the host country. The programs parsed only the first two octets of the IP address and looked them up in a table derived from GeoLite data created by MaxMind. The table, contained in a space-separated file all_classbs.txt, listed the Class B addresses used exclusively by a single country along with the country code. The hour of the visit can be extracted from the time stamp, which is the second component of the log file. The data used in the test was generated by a MapReduce program, the GeoWeb Apache Log 3 approaches to big data analysis with Apache Hadoop The Apache™ Hadoop® ecosystem provides a rich source of utilities that are key to helping enterprises unlock vital insights from large data sets. Discover how three powerful data analysis tools match up. By Dave Jaffe
  2. Dell.com/powersolutions | 2014 Issue 01 55Reprinted from Dell Power Solutions, 2014 Issue 1. Copyright © 2014 Dell Inc. All rights reserved. Generator.1 The program was designed to produce realistic sequential Apache web logs for a specified month, day, year and number of clicks per day. Remote hosts were distributed geographically among the top 20 internet-using countries2 and temporally so that each region was most active during its local evening hours, simulating a consumer or social website. Since the synthetic web logs created for the test represent just the top 20 internet- using countries, the output of each program consisted of 480 keys (20 countries over 24 hours), each associated with a value representing the total number of hits from that country during that hour. MapReduce: Parallel processing for large data sets The MapReduce framework provides a flexible, resilient and efficient mechanism for distributed computing over large server clusters. MapReduce coordinates distributed servers to run various tasks in parallel: map tasks that read data stored in HDFS and emit key-value pairs; combiner tasks that aggregate the values for each key being emitted by a mapper; and reducer tasks that process the values for each key. Writing a MapReduce program is a direct way to exploit the capabilities of this framework for data manipulation and analysis. Designing the MapReduce program to perform the geographical web analysis was fairly straightforward.3 The mapper read log files from HDFS one line at a time and parsed the first two octets of the remote IP address as well as the hour of web access. It then looked up the country corresponding to that IP address from a table generated from the all_classbs.txt file and emitted a key — with a value of 1 — comprising the country code and hour. The combiner and reducer added all the values per key and wrote 24 keys per detected country to HDFS, each with a value corresponding to the total number of hits coming from that country in that hour across the whole set of log files. Hive: Data warehouse infrastructure for ad hoc queries The Apache Hive tool projects structure onto data stored in HDFS and also Components in a standard Apache web log file User agentStatus / size / referrerRequest lineDate / time stamp 172.16.3.1 - - [27/Jun/2012:17:48:34 -0500] "GET /favicon.ico HTTP/1.1" 404 298 "http://110.240.0.17" "Mozilla/5.0 …" Remote host 1 Visit github.com/DaveJaffe/BigDataDemos to view more information and complete code for the GeoWeb Apache Weblog Generator tool. 2 Top 20 countries determined from 2011 Wikipedia data. 3 Visit github.com/DaveJaffe/BigDataDemos to view complete code listings of the MapReduce program, which comprises the GeoWeb.java driver, the GeoWebMapper.java mapper and the SumReducer.java combiner and reducer. Dive deeper Download this white paper for an in-depth exploration of the geographical and temporal analysis of Apache web logs using MapReduce, Hive and Pig. The appendices include code listings for the programs used in the analysis. qrs.ly/af3tmsi
  3. 56 2014 Issue 01 | Dell.com/powersolutions Features Reprinted from Dell Power Solutions, 2014 Issue 1. Copyright © 2014 Dell Inc. All rights reserved. provides a SQL interface, HiveQL (HQL), to query that data. Hive creates a query plan that implements HQL in a series of MapReduce programs, generates the code for these programs and then executes the code. The Hive program first defined the HDFS data in terms of SQL tables.4 Both the web logs and the mapping information contained in all_classbs.txt were turned into HQL tables. Once the data tables were defined, the program read the web log data and parsed it using a serializer-deserializer provided by Hive, RegexSerDe. The results were grouped by country code and hour, and the count of each combination was generated. Additional formatting was performed so that the output would resemble that of the MapReduce program. Pig: High-level data flow framework for parallel computation Apache Pig provides a data flow language, Pig Latin, that enables the user to specify reads, joins and other computations without the need to write a MapReduce program. Like Hive, Pig generates a sequence of MapReduce programs to implement the data analysis steps. The Pig program loaded the web logs and Class B IP address data from all_classbs.txt into two Pig data bags, or relations.5 Then the program parsed the web logs to extract the first two octets of the IP address and the hour from the time stamp. The two items, or tuple, were saved in a data bag and then joined with the IP address information, which was subsequently stored in a different data bag. The data bag was then Configuration details In October 2013, Dell engineers at the Dell Solution Center in Round Rock, Texas, tested programs written using MapReduce from Apache Hadoop 1.03, Apache Hive 0.9.0 and Apache Pig 0.11.1. The programs ran on a cluster that was based on the Intel® Distribution for Apache Hadoop version 2.4.1.* The cluster’s name nodes and edge node ran on Dell PowerEdge R720 servers, each with two eight-core Intel® Xeon® processors E5-2650 at 2.0 GHz, 128 GB of memory and six 600 GB Serial Attached SCSI (SAS) disks in a RAID-10 configuration. The 20 data nodes were PowerEdge R720xd servers with two eight-core Intel Xeon processors E5-2650 at 2.0 GHz, 64 GB of memory and twenty-four 500 GB Serial ATA (SATA) disks, each in a RAID-0 configuration. The total raw disk space on the cluster was over 200 TB. The servers were connected through Dell Networking S60 switches and Gigabit Ethernet (GbE) network interface cards. A set of 366 Apache web log files, one for each day of 2012, was created by the GeoWeb Apache Weblog Generator tool and stored in Hadoop Distributed File System (HDFS). Each day consisted of 11,900,000 log entries. The total size occupied by the log files was 1 TB. A second set of log files was created with 119,000,000 entries per day, for a total size of 10 TB. * The implementation of the Intel Distribution on Dell PowerEdge servers, including generalized cluster configuration parameters, is described in the Dell white paper “Intel Distribution for Apache Hadoop On Dell PowerEdge Servers,” available at qrs.ly/xg3tmsg. 4 Visit github.com/DaveJaffe/BigDataDemos to view complete code for the geoweb.q Hive program. 5 Visit github.com/DaveJaffe/BigDataDemos to view complete code for the geoweb.pig Pig program.
  4. Dell.com/powersolutions | 2014 Issue 01 57Reprinted from Dell Power Solutions, 2014 Issue 1. Copyright © 2014 Dell Inc. All rights reserved. grouped by the country code, hour tuple and the number counted for each group. The result was ordered by country code and hour and then stored back in HDFS. Pig compiled the data flow into MapReduce, resulting in a multi-pass MapReduce program. For each of the key steps — JOIN, GROUP, ORDER — a PARALLEL option may be specified as a hint to MapReduce to determine the number of map or reduce tasks to deploy for that step. Finally, the Dell team formatted the output to yield a result identical to that obtained from the MapReduce and Hive programs. Comparison of program performance The Dell team ran the MapReduce, Hive and Pig programs sequentially against the 1 TB set of log files. The total number of hits per country over the year and over the course of a 24-hour period was calculated. These percentages matched the input distribution, indicating that the parsing and processing of the IP address table and time stamp information worked properly for all three programs. Because the same IP address data was used to generate as well as analyze the remote IP addresses, 100 percent of the log entries successfully matched a country in this test. In a real-world scenario, however, the percentage of matches is expected to be lower. The team then ran the three programs against the 10 TB set of log files, and compared the performance to that of the 1 TB set of files (see figure). Selecting the right tool for the job To explore utilities available in the Hadoop ecosystem, Dell engineers used the MapReduce, Hive and Pig programs to analyze files in the standard Apache HTTP Server log format. The algorithms created by the Dell team can be adapted easily to other log formats. As might be expected, the MapReduce program performed the best for both sets of log files tested, because it is a single program explicitly written for the MapReduce framework. The Hive and Pig programs, which generate multiple MapReduce programs to accomplish the same task, took longer to execute. However, the performance difference was less pronounced with the larger data set size, indicating that the overhead of running multiple batch jobs in Hive and Pig had less impact on longer-running batch jobs. Moreover, all three programs showed excellent scalability; the large data set took less than 10 times as much time to analyze compared to the data set that was 10 times as small. These results demonstrated the trade-off between development time and execution time. Hive and Pig programs are usually quicker to develop but take longer to run than MapReduce programs, with less of a disadvantage for larger workloads. In the end, enterprises can effectively leverage all three approaches to harness the potential of big data for making informed business decisions. Tool Time to analyze 1 TB of web logs Time relative to MapReduce Time to analyze 10 TB of web logs Time relative to MapReduce Scaling 10 TB vs. 1 TB workload MapReduce 7 min 40 sec 1x 69 min 54 sec 1x 9.12x Hive 9 min 34 sec 1.25x 74 min 59 sec 1.07x 7.4x Pig 20 min 57 sec 2.73x 183 min 54 sec 2.63x 8.78x Comparison of program performance, in terms of elapsed time to analyze log data Dell and PowerEdge are trademarks of Dell Inc. Learn more Apache Hadoop: hadoop.apache.org Apache Hive: hive.apache.org Apache Pig: pig.apache.org Author Dave Jaffe is a solution architect for Dell Solution Centers.

×