Hadoop is a software framework that allows for distributed processing of large data sets across clusters of computers. It uses MapReduce as a programming model and HDFS for storage. MapReduce divides applications into parallelizable map and reduce tasks that process key-value pairs across large datasets in a reliable and fault-tolerant manner. HDFS stores multiple replicas of data blocks for reliability and allows processing of data in parallel on nodes where the data is located. Hadoop can reliably store and process petabytes of data on thousands of low-cost commodity hardware nodes.
Hadoop Online Training : kelly technologies is the bestHadoop online Training Institutes in Bangalore. ProvidingHadoop online Training by real time faculty in Bangalore.
This presentation will give you Information about :
1.Configuring HDFS
2.Interacting With HDFS
3.HDFS Permissions and Security
4.Additional HDFS Tasks
HDFS Overview and Architecture
5.HDFS Installation
6.Hadoop File System Shell
7.File System Java API
it just provide information about hadoop what is hadoop and how hadoop overcomes the disadvantage of distributed system and i have also shown an example program for mapreduce
Hadoop Online Training : kelly technologies is the bestHadoop online Training Institutes in Bangalore. ProvidingHadoop online Training by real time faculty in Bangalore.
This presentation will give you Information about :
1.Configuring HDFS
2.Interacting With HDFS
3.HDFS Permissions and Security
4.Additional HDFS Tasks
HDFS Overview and Architecture
5.HDFS Installation
6.Hadoop File System Shell
7.File System Java API
it just provide information about hadoop what is hadoop and how hadoop overcomes the disadvantage of distributed system and i have also shown an example program for mapreduce
Jumpstart your career with the world’s most in-demand technology: Hadoop. Hadooptrainingacademy provides best Hadoop online training with quality videos, comprehensive
online live training and detailed study material. Join today!
For more info, visit: http://www.hadooptrainingacademy.com/
Contact Us:
8121660088
732-419-2619
http://www.hadooptrainingacademy.com/
Hadoop Training, Enhance your Big data subject knowledge with Online Training without wasting your time. Register for Free LIVE DEMO Class.
For more info: http://www.hadooponlinetutor.com
Contact Us:
8121660044
732-419-2619
http://www.hadooponlinetutor.com
Architecting Big Data Ingest & ManipulationGeorge Long
Here's the presentation I gave at the KW Big Data Peer2Peer meetup held at Communitech on 3rd November 2015.
The deck served as a backdrop to the interactive session
http://www.meetup.com/KW-Big-Data-Peer2Peer/events/226065176/
The scope was to drive an architectural conversation about :
o What it actually takes to get the data you need to add that one metric to your report/dashboard?
o What's it like to navigate the early conversations of an analytic solution?
o How is one technology selected over another and how do those selections impact or define other selections?
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Jumpstart your career with the world’s most in-demand technology: Hadoop. Hadooptrainingacademy provides best Hadoop online training with quality videos, comprehensive
online live training and detailed study material. Join today!
For more info, visit: http://www.hadooptrainingacademy.com/
Contact Us:
8121660088
732-419-2619
http://www.hadooptrainingacademy.com/
Hadoop Training, Enhance your Big data subject knowledge with Online Training without wasting your time. Register for Free LIVE DEMO Class.
For more info: http://www.hadooponlinetutor.com
Contact Us:
8121660044
732-419-2619
http://www.hadooponlinetutor.com
Architecting Big Data Ingest & ManipulationGeorge Long
Here's the presentation I gave at the KW Big Data Peer2Peer meetup held at Communitech on 3rd November 2015.
The deck served as a backdrop to the interactive session
http://www.meetup.com/KW-Big-Data-Peer2Peer/events/226065176/
The scope was to drive an architectural conversation about :
o What it actually takes to get the data you need to add that one metric to your report/dashboard?
o What's it like to navigate the early conversations of an analytic solution?
o How is one technology selected over another and how do those selections impact or define other selections?
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
1. Hadoop: A Software Framework for Data
Intensive Computing Applications
Ravi Mukkamala
Department of Computer Science
Old Dominion University
Norfolk, VA, USA
mukka@cs.odu.edu
2. What is Hadoop?
• Software platform that lets one easily write and run applications that
process vast amounts of data. It includes:
– MapReduce – offline computing engine
– HDFS – Hadoop distributed file system
– HBase (pre-alpha) – online data access
• Yahoo! is the biggest contributor
• Here's what makes it especially useful:
– Scalable: It can reliably store and process petabytes.
– Economical: It distributes the data and processing across clusters of
commonly available computers (in thousands).
– Efficient: By distributing the data, it can process it in parallel on the
nodes where the data is located.
– Reliable: It automatically maintains multiple copies of data and
automatically redeploys computing tasks based on failures.
2
Sathya Sai University, Prashanti Nilayam
3. What does it do?
• Hadoop implements Google’s MapReduce, using HDFS
• MapReduce divides applications into many small blocks of work.
• HDFS creates multiple replicas of data blocks for reliability, placing them
on compute nodes around the cluster.
• MapReduce can then process the data where it is located.
• Hadoop ‘s target is to run on clusters of the order of 10,000-nodes.
3
Sathya Sai University, Prashanti Nilayam
4. Hadoop: Assumptions
It is written with large clusters of computers in mind and is built around the
following assumptions:
– Hardware will fail.
– Processing will be run in batches. Thus there is an emphasis on high
throughput as opposed to low latency.
– Applications that run on HDFS have large data sets. A typical file in
HDFS is gigabytes to terabytes in size.
– It should provide high aggregate data bandwidth and scale to
hundreds of nodes in a single cluster. It should support tens of millions
of files in a single instance.
– Applications need a write-once-read-many access model.
– Moving Computation is Cheaper than Moving Data.
– Portability is important.
4
Sathya Sai University, Prashanti Nilayam
5. Apache Hadoop Wins Terabyte Sort Benchmark (July 2008)
• One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds,
which beat the previous record of 297 seconds in the annual general
purpose (daytona) terabyte sort benchmark. The sort benchmark specifies
the input data (10 billion 100 byte records), which must be completely
sorted and written to disk.
• The sort used 1800 maps and 1800 reduces and allocated enough memory
to buffers to hold the intermediate data in memory.
• The cluster had 910 nodes; 2 quad core Xeons @ 2.0ghz per node; 4 SATA
disks per node; 8G RAM per a node; 1 gigabit ethernet on each node; 40
nodes per a rack; 8 gigabit ethernet uplinks from each rack to the core;
Red Hat Enterprise Linux Server Release 5.1 (kernel 2.6.18); Sun Java JDK
1.6.0_05-b13
5
Sathya Sai University, Prashanti Nilayam
6. Example Applications and Organizations using Hadoop
• A9.com – Amazon: To build Amazon's product search indices; process millions of
sessions daily for analytics, using both the Java and streaming APIs; clusters vary
from 1 to 100 nodes.
• Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop; biggest
cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support research
for Ad Systems and Web Search
• AOL : Used for a variety of things ranging from statistics generation to running
advanced algorithms for doing behavioral analysis and targeting; cluster size is 50
machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB
hard-disk giving us a total of 37 TB HDFS capacity.
• Facebook: To store copies of internal log and dimension data sources and use it as a
source for reporting/analytics and machine learning; 320 machine cluster with
2,560 cores and about 1.3 PB raw storage;
• FOX Interactive Media : 3 X 20 machine cluster (8 cores/machine, 2TB/machine
storage) ; 10 machine cluster (8 cores/machine, 1TB/machine storage); Used for log
analysis, data mining and machine learning
• University of Nebraska Lincoln: one medium-sized Hadoop cluster (200TB) to
store and serve physics data;
6
Sathya Sai University, Prashanti Nilayam
7. More Hadoop Applications
• Adknowledge - to build the recommender system for behavioral targeting, plus
other clickstream analytics; clusters vary from 50 to 200 nodes, mostly on EC2.
• Contextweb - to store ad serving log and use it as a source for Ad optimizations/
Analytics/reporting/machine learning; 23 machine cluster with 184 cores and about
35TB raw storage. Each (commodity) node has 8 cores, 8GB RAM and 1.7 TB of
storage.
• Cornell University Web Lab: Generating web graphs on 100 nodes (dual 2.4GHz
Xeon Processor, 2 GB RAM, 72GB Hard Drive)
• NetSeer - Up to 1000 instances on Amazon EC2 ; Data storage in Amazon S3; Used
for crawling, processing, serving and log analysis
• The New York Times : Large scale image conversions ; EC2 to run hadoop on a
large virtual cluster
• Powerset / Microsoft - Natural Language Search; up to 400 instances on Amazon
EC2 ; data storage in Amazon S3
7
Sathya Sai University, Prashanti Nilayam
8. MapReduce Paradigm
• Programming model developed at Google
• Sort/merge based distributed computing
• Initially, it was intended for their internal search/indexing
application, but now used extensively by more organizations
(e.g., Yahoo, Amazon.com, IBM, etc.)
• It is functional style programming (e.g., LISP) that is naturally
parallelizable across a large cluster of workstations or PCS.
• The underlying system takes care of the partitioning of the
input data, scheduling the program’s execution across
several machines, handling machine failures, and managing
required inter-machine communication. (This is the key for
Hadoop’s success)
8
Sathya Sai University, Prashanti Nilayam
9. How does MapReduce work?
• The run time partitions the input and provides it to different
Map instances;
• Map (key, value) (key’, value’)
• The run time collects the (key’, value’) pairs and distributes
them to several Reduce functions so that each Reduce
function gets the pairs with the same key’.
• Each Reduce produces a single (or zero) file output.
• Map and Reduce are user written functions
9
Sathya Sai University, Prashanti Nilayam
10. Example MapReduce: To count the occurrences
of words in the given set of documents
map(String key, String value):
// key: document name; value: document contents; map (k1,v1) list(k2,v2)
for each word w in value: EmitIntermediate(w, "1");
(Example: If input string is (“Saibaba is God. I am I”), Map produces
{<“Saibaba”,1”>, <“is”, 1>, <“God”, 1>, <“I”,1>, <“am”,1>,<“I”,1>}
reduce(String key, Iterator values):
// key: a word; values: a list of counts; reduce (k2,list(v2)) list(v2)
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
(Example: reduce(“I”, <1,1>) 2)
10
Sathya Sai University, Prashanti Nilayam
11. Example applications
• Distributed grep (as in Unix grep command)
• Count of URL Access Frequency
• ReverseWeb-Link Graph: list of all source
URLs associated with a given target URL
• Inverted index: Produces <word,
list(Document ID)> pairs
• Distributed sort
11
Sathya Sai University, Prashanti Nilayam
12. MapReduce-Fault tolerance
• Worker failure: The master pings every worker periodically. If no
response is received from a worker in a certain amount of time, the
master marks the worker as failed. Any map tasks completed by the
worker are reset back to their initial idle state, and therefore become
eligible for scheduling on other workers. Similarly, any map task or reduce
task in progress on a failed worker is also reset to idle and becomes
eligible for rescheduling.
• Master Failure: It is easy to make the master write periodic checkpoints of
the master data structures described above. If the master task dies, a new
copy can be started from the last checkpointed state. However, in most
cases, the user restarts the job.
12
Sathya Sai University, Prashanti Nilayam
13. Mapping workers to Processors
• The input data (on HDFS) is stored on the local disks of the machines in
the cluster. HDFS divides each file into 64 MB blocks, and stores several
copies of each block (typically 3 copies) on different machines.
• The MapReduce master takes the location information of the input files
into account and attempts to schedule a map task on a machine that
contains a replica of the corresponding input data. Failing that, it attempts
to schedule a map task near a replica of that task's input data. When
running large MapReduce operations on a significant fraction of the
workers in a cluster, most input data is read locally and consumes no
network bandwidth.
13
Sathya Sai University, Prashanti Nilayam
14. Task Granularity
• The map phase has M pieces and the reduce phase has R pieces.
• M and R should be much larger than the number of worker machines.
• Having each worker perform many different tasks improves dynamic load
balancing, and also speeds up recovery when a worker fails.
• Larger the M and R, more the decisions the master must make
• R is often constrained by users because the output of each reduce task ends
up in a separate output file.
• Typically, (at Google), M = 200,000 and R = 5,000, using 2,000 worker
machines.
14
Sathya Sai University, Prashanti Nilayam
15. Additional support functions
• Partitioning function: The users of MapReduce specify the number of
reduce tasks/output files that they desire (R). Data gets partitioned across
these tasks using a partitioning function on the intermediate key. A default
partitioning function is provided that uses hashing (e.g. .hash(key) mod R.).
In some cases, it may be useful to partition data by some other function of
the key. The user of the MapReduce library can provide a special
partitioning function.
• Combiner function: User can specify a Combiner function that does
partial merging of the intermediate local disk data before it is sent over the
network. The Combiner function is executed on each machine that
performs a map task. Typically the same code is used to implement both the
combiner and the reduce functions.
15
Sathya Sai University, Prashanti Nilayam
16. Problem: seeks are expensive
● CPU & transfer speed, RAM & disk size – double every 18-24 months
● Seek time nearly constant (~5%/year)
● Time to read entire drive is growing – scalable computing must go at transfer rate
• Example: Updating a terabyte DB, given: 10MB/s transfer, 10ms/seek, 100B/entry
(10Billion entries), 10kB/page (1Billion pages)
● Updating 1% of entries (100Million) takes:
– 1000 days with random B-Tree updates
– 100 days with batched B-Tree updates
– 1 day with sort & merge
• To process 100TB datasets
• on 1 node: – scanning @ 50MB/s = 23 days
• on 1000 node cluster: – scanning @ 50MB/s = 33 min
– MTBF = 1 day
• Need framework for distribution – efficient, reliable, easy to use
16
Sathya Sai University, Prashanti Nilayam
17. HDFS
• The Hadoop Distributed File System (HDFS) is a distributed file system
designed to run on commodity hardware. It has many similarities with
existing distributed file systems. However, the differences from other
distributed file systems are significant.
– highly fault-tolerant and is designed to be deployed on low-cost
hardware.
– provides high throughput access to application data and is suitable for
applications that have large data sets.
– relaxes a few POSIX requirements to enable streaming access to file
system data.
– part of the Apache Hadoop Core project. The project URL is
http://hadoop.apache.org/core/.
17
Sathya Sai University, Prashanti Nilayam
19. Example runs [1]
• Cluster configuration: ≈1800 machines; each with two 2GHz Intel Xeon
processors with 4GB of memory (1-1.5 GB reserved for other tasks), two
160GB IDE disks, and a gigabit Ethernet link. All of the machines were in
the same hosting facility and therefore the round-trip time between any pair
of machines was less than a millisecond.
• Grep: Scans through 1010 100-byte records (distributed over 1000 input
file by GFS), searching for a relatively rare three-character pattern (the
pattern occurs in 92,337 records). The input is split into approximately
64MB pieces (M = 15000), and the entire output is placed in one file (R =
1). The entire computation took approximately 150 seconds from start to
finish including 60 seconds to start the job.
• Sort: Sorts 10 10 100-byte records (approximately 1 terabyte of data). As
before, the input data is split into 64MB pieces (M = 15000) and R = 4000.
Including startup overhead, the entire computation took 891 seconds.
19
Sathya Sai University, Prashanti Nilayam
20. Execution overview
1. The MapReduce library in the user program first splits input files into M
pieces of typically 16 MB to 64 MB/piece. It then starts up many copies of
the program on a cluster of machines.
2. One of the copies of the program is the master. The rest are workers that are
assigned work by the master. There are M map tasks and R reduce tasks to
assign. The master picks idle workers and assigns each one a map task or a
reduce task.
3. A worker who is assigned a map task reads the contents of the assigned
input split. It parses key/value pairs out of the input data and passes each
pair to the user-defined Map function. The intermediate key/value pairs
produced by the Map function are buffered in memory.
4. The locations of these buffered pairs on the local disk are passed back to the
master, who forwards these locations to the reduce workers.
20
Sathya Sai University, Prashanti Nilayam
21. Execution overview (cont.)
5. When a reduce worker is notified by the master about these locations, it uses
RPC remote procedure calls to read the buffered data from the local disks
of the map workers. When a reduce worker has read all intermediate data, it
sorts it by the intermediate keys so that all occurrences of the same key are
grouped together.
6. The reduce worker iterates over the sorted intermediate data and for each
unique intermediate key encountered, it passes the key and the
corresponding set of intermediate values to the user's Reduce function. The
output of the Reduce function is appended to a final output file for this
reduce partition.
7. When all map tasks and reduce tasks have been completed, the master
wakes up the user program---the MapReduce call in the user program
returns back to the user code. The output of the mapreduce execution is
available in the R output files (one per reduce task).
21
Sathya Sai University, Prashanti Nilayam
22. References
[1] J. Dean and S. Ghemawat, ``MapReduce: Simplied Data Processing on
Large Clusters,’’ OSDI 2004. (Google)
[2] D. Cutting and E. Baldeschwieler, ``Meet Hadoop,’’ OSCON, Portland,
OR, USA, 25 July 2007 (Yahoo!)
[3] R. E. Brayant, “Data Intensive Scalable computing: The case for DISC,”
Tech Report: CMU-CS-07-128, http://www.cs.cmu.edu/~bryant
22
Sathya Sai University, Prashanti Nilayam