1. The document discusses large-scale data storage and processing options for scientists in the Netherlands, focusing on Hadoop and its components HDFS and MapReduce.
2. HDFS provides a distributed file system that stores very large datasets across clusters of machines, while MapReduce allows processing of datasets in parallel across a cluster.
3. A case study is described that uses HDFS for storage of a 2.7TB text file and MapReduce for analyzing the data to study category evolution in Wikipedia articles over time.
Introduction to SARA's Hadoop Hackathon - dec 7th 2010Evert Lammerts
This was the first of two introduction presentations to the first Hadoop Hackathon at SARA, the Dutch center for High Performance Computing and Networking.
Introduction to SARA's Hadoop Hackathon - dec 7th 2010Evert Lammerts
This was the first of two introduction presentations to the first Hadoop Hackathon at SARA, the Dutch center for High Performance Computing and Networking.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
Todd Lipcon explains why you should be interested in Apache Hadoop, what it is, and how it works. Todd also brings to light the Hadoop ecosystem and real business use cases that evolve around Hadoop and the ecosystem.
this presentation describes the company from where I did my summer training and what is bigdata why we use big data, big data challenges, the issue in big data, the solution of big data issues, hadoop, docker , Ansible etc.
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
This is a power point presentation on Hadoop and Big Data. This covers the essential knowledge one should have when stepping into the world of Big Data.
This course is available on hadoop-skills.com for free!
This course builds a basic fundamental understanding of Big Data problems and Hadoop as a solution. This course takes you through:
• This course builds Understanding of Big Data problems with easy to understand examples and illustrations.
• History and advent of Hadoop right from when Hadoop wasn’t even named Hadoop and was called Nutch
• What is Hadoop Magic which makes it so unique and powerful.
• Understanding the difference between Data science and data engineering, which is one of the big confusions in selecting a carrier or understanding a job role.
• And most importantly, demystifying Hadoop vendors like Cloudera, MapR and Hortonworks by understanding about them.
This course is available for free on hadoop-skills.com
Apache Hadoop, since its humble beginning as an execution engine for web crawler and building search indexes, has matured into a general purpose distributed application platform and data store. Large Scale Machine Learning (LSML) techniques and algorithms proved to be quite tricky for Hadoop to handle, ever since we started offering Hadoop as a service at Yahoo in 2006. In this talk, I will discuss early experiments of implementing LSML algorithms on Hadoop at Yahoo. I will describe how it changed Hadoop, and led to generalization of the Hadoop platform to accommodate programming paradigms other than MapReduce. I will unveil some of our recent efforts to incorporate diverse LSML runtimes into Hadoop, evolving it to become *THE* LSML platform. I will also make a case for an industry-standard LSML benchmark, based on common deep analytics pipelines that utilize LSML workload.
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Ashok Royal
Bigdata Hadoop, Its components and a Hadoop project is described in Details.
Visit http://hadoop-beginners.blogspot.com to see Hadoop Tutorials.
Thanks for the visit. :)
Accompanying slides for the class “Introduction to Hadoop” at the PRACE Autumn school 2020 - HPC and FAIR Big Data organized by the faculty of Mechanical Engineering of the University of Ljubljana (Slovenia).
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
Todd Lipcon explains why you should be interested in Apache Hadoop, what it is, and how it works. Todd also brings to light the Hadoop ecosystem and real business use cases that evolve around Hadoop and the ecosystem.
this presentation describes the company from where I did my summer training and what is bigdata why we use big data, big data challenges, the issue in big data, the solution of big data issues, hadoop, docker , Ansible etc.
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
This is a power point presentation on Hadoop and Big Data. This covers the essential knowledge one should have when stepping into the world of Big Data.
This course is available on hadoop-skills.com for free!
This course builds a basic fundamental understanding of Big Data problems and Hadoop as a solution. This course takes you through:
• This course builds Understanding of Big Data problems with easy to understand examples and illustrations.
• History and advent of Hadoop right from when Hadoop wasn’t even named Hadoop and was called Nutch
• What is Hadoop Magic which makes it so unique and powerful.
• Understanding the difference between Data science and data engineering, which is one of the big confusions in selecting a carrier or understanding a job role.
• And most importantly, demystifying Hadoop vendors like Cloudera, MapR and Hortonworks by understanding about them.
This course is available for free on hadoop-skills.com
Apache Hadoop, since its humble beginning as an execution engine for web crawler and building search indexes, has matured into a general purpose distributed application platform and data store. Large Scale Machine Learning (LSML) techniques and algorithms proved to be quite tricky for Hadoop to handle, ever since we started offering Hadoop as a service at Yahoo in 2006. In this talk, I will discuss early experiments of implementing LSML algorithms on Hadoop at Yahoo. I will describe how it changed Hadoop, and led to generalization of the Hadoop platform to accommodate programming paradigms other than MapReduce. I will unveil some of our recent efforts to incorporate diverse LSML runtimes into Hadoop, evolving it to become *THE* LSML platform. I will also make a case for an industry-standard LSML benchmark, based on common deep analytics pipelines that utilize LSML workload.
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Ashok Royal
Bigdata Hadoop, Its components and a Hadoop project is described in Details.
Visit http://hadoop-beginners.blogspot.com to see Hadoop Tutorials.
Thanks for the visit. :)
Accompanying slides for the class “Introduction to Hadoop” at the PRACE Autumn school 2020 - HPC and FAIR Big Data organized by the faculty of Mechanical Engineering of the University of Ljubljana (Slovenia).
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
Presentation held at GRNET Digital Technology Symposium on November 5-6, 2018 at the Stavros Niarchos Foundation Cultural Center, Athens, Greece.
• Introduction to Ceph and its internals
• Presentation of GRNET's Ceph deployments (technical specs, operations)
• Usecases: ESA Copernicus, ~okeanos, ViMa
Hadoop Online Training : kelly technologies is the bestHadoop online Training Institutes in Bangalore. ProvidingHadoop online Training by real time faculty in Bangalore.
This presentation will give you Information about :
1.Configuring HDFS
2.Interacting With HDFS
3.HDFS Permissions and Security
4.Additional HDFS Tasks
HDFS Overview and Architecture
5.HDFS Installation
6.Hadoop File System Shell
7.File System Java API
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
4. Case Study:
Virtual Knowledge Studio
http://simshelf2.virtualknowledgestudio.nl/activities/biggrid-wikipedia-experiment
● How do categories in WikiPedia
evolve over time? (And how do
they relate to internal links?)
● 2.7 TB raw text, single file
● Java application, searches for
categories in Wiki markup,
like [[Category:NAME]]
● Executed on the Grid
BioAssist Programmers' Day, January 21, 2011
5. 1.1) Copy file from local 2.1) Stream file from Grid 3.1) Process all files in
Machine to Grid storage Storage to single machine parallel: N machines
2.2) Cut into pieces of 10 GB run the Java application,
2.3) Stream back to Grid fetch a 10GB file as
Storage input, processing it, and
putting the result back
BioAssist Programmers' Day, January 21, 2011
6. Status Quo:
Arrange your own (Data-)parallelism
● Cut the dataset up in “processable chunks”:
● Size of chunk depending on local space on processing node...
● … on the total processing capacity available …
● … on the smallest unit of work (“largest grade of parallelism”)...
● … on the substance (sometimes you don't want many output files, e.g.
when building a search index).
● Submit the amount of jobs you consider necessary:
● To a cluster close to your data (270x10GB over WAN is a bad idea)
● Amount might depend on cluster capacity, amount of chunks, smallest
unit of work, substance...
When dealing with large data, let's say 100GB+, this is
ERROR PRONE, TIME CONSUMING AND NOT FOR NEWBIES!
BioAssist Programmers' Day, January 21, 2011
10. Hadoop Distributed File System (HDFS)
● Very large DFS. Order of magnitude:
– 10k nodes
– millions of files
– PetaBytes of storage
● Assumes “commodity hardware”:
– redundancy through replication
– failure handling and recovery
● Optimized for batch processing:
– locations of data exposed to computation
– high aggregate bandwidth
BioAssist Programmers' Day, January 21, 2011
http://www.slideshare.net/jhammerb/hdfs-architecture
11. HDFS Continued...
● Single Namespace for the entire cluster
● Data coherency
– Write-once-read-many model
– Only appending is supported for existing files
● Files are broken up in chunks (“blocks”)
– Blocksizes ranging from 64 to 256 MB, depending on configuration
– Blocks are distributed over nodes (a single FILE, existing of N
blocks, is stored on M nodes)
– Blocks are replicated and replications are distributed
● Client accesses the blocks of a file at the nodes directly
– This creates high aggregate bandwidth!
BioAssist Programmers' Day, January 21, 2011
http://www.slideshare.net/jhammerb/hdfs-architecture
12. HDFS NameNode & DataNodes
NameNode DataNode
● Manages File System Namespace ● A “Block Server”
– Mapping filename to blocks – Stores data in local FS
– Mapping blocks to DataNode – Stores metadata of a block (e.g. hash)
● Cluster Configuration – Serve (meta)data to clients
● Replication Management ● Facilitates pipeline to other DN's
BioAssist Programmers' Day, January 21, 2011
13. http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html
Metadata operations
● Communicate with NN only
– ls (see above), lsr, df, du, chmod,
chown... etc.
R/W (block) operations
● Communicate with NN and DN's
– put, copyFromLocal, CopyToLocal,
tail... etc.
BioAssist Programmers' Day, January 21, 2011
14. HDFS
Application Programming Interface (API)
● Enables programmers to access any HDFS from their code
● Described at
http://hadoop.apache.org/common/docs/r0.20.0/api/index.html
● Written in (and thus available for) Java, but...
● Is also exposed through Apache Thrift, so can be accessed
from:
● C++, Python, PHP, Ruby, and others
● See http://wiki.apache.org/hadoop/HDFS-APIs
● Has a separate C-API (libhdfs)
So: you can enable your services to work with HDFS
BioAssist Programmers' Day, January 21, 2011
15. MapReduce
● Is a framework for distributed (parallel) processing
of large datasets
● Provides a programming model
● Lets users plug-in own code
● Uses a common pattern:
cat | grep | sort | unique > file
input | map | shuffle | reduce > output
● Is useful for large scale data analytics and
processing
BioAssist Programmers' Day, January 21, 2011
16. MapReduce Continued...
● Is great for processing large datasets!
– Send computation to data, so little data over lines
– Uses blocks stored in the DFS, so no splitting required
(this is a bit more sophisticated depending on your input)
● Handles parallelism for you
– One map per block, if possible
● Scales basically linearly
– time_on_cluster = time_on_single_core / total_cores
● Java, but streaming possible (plus others, see later)
BioAssist Programmers' Day, January 21, 2011
17. MapReduce
JobTracker & TaskTrackers
JobTracker TaskTracker
● Holds job metadata ● Requests work from JT
– Status of job – Fetch the code to execute from the DFS
– Status of Tasks running on TTs – Apply job-specific configuration
● Decides on scheduling ● Communicate with JT on tasks:
● Delegates creation of 'InputSplits' – Sending output, Killing tasks, Task updates, etc
BioAssist Programmers' Day, January 21, 2011
19. MapReduce
Application Programming Interface (API)
● Enables programmers to write MapReduce jobs
● More info on MR jobs:
http://www.slideshare.net/evertlammerts/infodoc-6107350
● Enables programmers to communicate with a
JobTracker
● Submitting jobs, getting statuses, cancelling jobs,
etc
● Described at
http://hadoop.apache.org/common/docs/r0.20.0/api/index.html
BioAssist Programmers' Day, January 21, 2011
20. Case Study:
Virtual Knowledge Studio
1) Load file into 2) Submit code
HDFS to MR
BioAssist Programmers' Day, January 21, 2011
21. What's more on Hadoop?
Lots!
● Apache Pig http://pig.apache.org
– Analyze datasets in a high level language, “Pig Latin”
– Simple! SQL like. Extremely fast experiments.
– N-stage jobs (MR chaining!)
● Apache Hive http://hive.apache.org
– Data Warehousing
– Hive QL
● Apache Hbase http://hbase.apache.org
– BigTable implementation (Google)
– In-memory operation
– Performance good enough for websites (Facebook built its Messaging Platform on top of it)
● Yahoo! Oozie http://yahoo.github.com/oozie/
– Hadoop workflow engine
● Apache [AVRO | Chukwa | Hama | Mahout] and so on
● 3rd Party:
– ElephantBird
– Cloudera's Distribution for Hadoop
– Hue
– Yahoo's Distribution for Hadoop
BioAssist Programmers' Day, January 21, 2011
22. Hadoop @ SARA
A prototype cluster
● Since December 2010
● 20 cores for MR (TT's)
● 110 TB gross for HDFS (DN) (55TB net)
● Hue web-interface for job submission & management
● SFTP interface to HDFS
● Pig 0.8
● Hive
● Available for scientists / scientific programmers until May / June 2011
Towards a production infrastructure?
● Depending on results
It's open for you all as well: ask me for an account!
BioAssist Programmers' Day, January 21, 2011
24. Hadoop for:
● Large-scale data storage and processing
● Fundamental difference: data locality!
● Small files? Don't, but... Hadoop Archives (HAR)
● Archival? Don't. Use tape storage. (We have lots!)
● Very fast analytics (Pig!)
● For data-parallel applications (not good at crossproducts – use
Huygens or Lisa!)
● Legacy applications possible through piping / streaming (Weird
dependencies? Use Cloud!)
We'll do another Hackathon on Hadoop. Interested? Send me a mail!
BioAssist Programmers' Day, January 21, 2011