The document provides an introduction to big data and Hadoop. It defines big data as large datasets that cannot be processed using traditional computing techniques due to the volume, variety, velocity, and other characteristics of the data. It discusses traditional data processing versus big data and introduces Hadoop as an open-source framework for storing, processing, and analyzing large datasets in a distributed environment. The document outlines the key components of Hadoop including HDFS, MapReduce, YARN, and Hadoop distributions from vendors like Cloudera and Hortonworks.
View the Big Data Technology Stack in a nutshell. This Big Data Technology Stack deck covers the different layers of the Big Data world and summarizes the major technologies in vogue today.
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
This presentation about Hadoop for beginners will help you understand what is Hadoop, why Hadoop, what is Hadoop HDFS, Hadoop MapReduce, Hadoop YARN, a use case of Hadoop and finally a demo on HDFS (Hadoop Distributed File System), MapReduce and YARN. Big Data is a massive amount of data which cannot be stored, processed, and analyzed using traditional systems. To overcome this problem, we use Hadoop. Hadoop is a framework which stores and handles Big Data in a distributed and parallel fashion. Hadoop overcomes the challenges of Big Data. Hadoop has three components HDFS, MapReduce, and YARN. HDFS is the storage unit of Hadoop, MapReduce is its processing unit, and YARN is the resource management unit of Hadoop. In this video, we will look into these units individually and also see a demo on each of these units.
Below topics are explained in this Hadoop presentation:
1. What is Hadoop
2. Why Hadoop
3. Big Data generation
4. Hadoop HDFS
5. Hadoop MapReduce
6. Hadoop YARN
7. Use of Hadoop
8. Demo on HDFS, MapReduce and YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
View the Big Data Technology Stack in a nutshell. This Big Data Technology Stack deck covers the different layers of the Big Data world and summarizes the major technologies in vogue today.
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
This presentation about Hadoop for beginners will help you understand what is Hadoop, why Hadoop, what is Hadoop HDFS, Hadoop MapReduce, Hadoop YARN, a use case of Hadoop and finally a demo on HDFS (Hadoop Distributed File System), MapReduce and YARN. Big Data is a massive amount of data which cannot be stored, processed, and analyzed using traditional systems. To overcome this problem, we use Hadoop. Hadoop is a framework which stores and handles Big Data in a distributed and parallel fashion. Hadoop overcomes the challenges of Big Data. Hadoop has three components HDFS, MapReduce, and YARN. HDFS is the storage unit of Hadoop, MapReduce is its processing unit, and YARN is the resource management unit of Hadoop. In this video, we will look into these units individually and also see a demo on each of these units.
Below topics are explained in this Hadoop presentation:
1. What is Hadoop
2. Why Hadoop
3. Big Data generation
4. Hadoop HDFS
5. Hadoop MapReduce
6. Hadoop YARN
7. Use of Hadoop
8. Demo on HDFS, MapReduce and YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
This presentation Simplify the concepts of Big data and NoSQL databases & Hadoop components.
The Original Source:
http://zohararad.github.io/presentations/big-data-introduction/
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
Presentation regarding big data. The presentation also contains basics regarding Hadoop and Hadoop components along with their architecture. Contents of the PPT are
1. Understanding Big Data
2. Understanding Hadoop & It’s Components
3. Components of Hadoop Ecosystem
4. Data Storage Component of Hadoop
5. Data Processing Component of Hadoop
6. Data Access Component of Hadoop
7. Data Management Component of Hadoop
8.Hadoop Security Management Tool: Knox ,Ranger
Hadoop has showed itself as a great tool in resolving problems with different data aspects as Data Velocity, Variety and Volume, that are causing troubles to relational database storage. In this presentation you'll learn what problems with data are occurring nowdays and how Hadoop can solve them . You'll learn about Hadop basic components and principles that make Hadoop such great tool.
The presentation covers following topics: 1) Hadoop Introduction 2) Hadoop nodes and daemons 3) Architecture 4) Hadoop best features 5) Hadoop characteristics. For more further knowledge of Hadoop refer the link: http://data-flair.training/blogs/hadoop-tutorial-for-beginners/
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
Vikram Andem, Senior Manager, United Airlines, A case for Bigdata Program and Strategy @ IATA Technology Roadmap 2014, October 13th, 2014, Montréal, Canada
This presentation Simplify the concepts of Big data and NoSQL databases & Hadoop components.
The Original Source:
http://zohararad.github.io/presentations/big-data-introduction/
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
Presentation regarding big data. The presentation also contains basics regarding Hadoop and Hadoop components along with their architecture. Contents of the PPT are
1. Understanding Big Data
2. Understanding Hadoop & It’s Components
3. Components of Hadoop Ecosystem
4. Data Storage Component of Hadoop
5. Data Processing Component of Hadoop
6. Data Access Component of Hadoop
7. Data Management Component of Hadoop
8.Hadoop Security Management Tool: Knox ,Ranger
Hadoop has showed itself as a great tool in resolving problems with different data aspects as Data Velocity, Variety and Volume, that are causing troubles to relational database storage. In this presentation you'll learn what problems with data are occurring nowdays and how Hadoop can solve them . You'll learn about Hadop basic components and principles that make Hadoop such great tool.
The presentation covers following topics: 1) Hadoop Introduction 2) Hadoop nodes and daemons 3) Architecture 4) Hadoop best features 5) Hadoop characteristics. For more further knowledge of Hadoop refer the link: http://data-flair.training/blogs/hadoop-tutorial-for-beginners/
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
Vikram Andem, Senior Manager, United Airlines, A case for Bigdata Program and Strategy @ IATA Technology Roadmap 2014, October 13th, 2014, Montréal, Canada
This Presentation gives an insight into what is big data, data analytics, difference between big data and data science.And also salary trends in big data analytics.
Big data is data that, by virtue of its velocity, volume, or variety (the three Vs), cannot be easily stored or analyzed with traditional methods. Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware.
Big data analytics: Technology's bleeding edgeBhavya Gulati
There can be data without information , but there can not be information without data.
Companies without Big Data Analytics are deaf and dumb , mere wanderers on web.
this presentation describes the company from where I did my summer training and what is bigdata why we use big data, big data challenges, the issue in big data, the solution of big data issues, hadoop, docker , Ansible etc.
I have collected information for the beginners to provide an overview of big data and hadoop which will help them to understand the basics and give them a Start-Up.
A brief intro on the idea of what is Big Data and it's potential. This is primarily a basic study & I have quoted the source of infographics, stats & text at the end. If I have missed any reference due to human error & you recognize another source, please mention.
An Introduction to Big data and problems associated with storing and analyzing big data and How Hadoop solves the problem with its HDFS and MapReduce frameworks. A little intro to HDInsight, Hadoop on windows azure.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
3. Outline
Introduction to Big Data
Traditional Data Processing
Introduction to Hadoop
HadoopArchitecture
Hadoop and RDBMS
Hadoop Distributions
4. What is Big Data?
Collection of large datasets that cannot be processed using traditional
computing techniques.
Not a single technique or a tool, rather a complete subject. Various tools,
techniques and frameworks.
It’s not the amount of data that’s important. It’s what organisations do with
the data that matters.
Big data can be analysed for insights that lead to better decisions and
strategic business moves.
5. Traditional Data Processing
For storage purpose, the programmers will take the help of their choice of database
vendors such as Oracle, IBM and others.
An enterprise will have a computer to store and process big data.
The user interacts with the application, which in turn handles the part of data
storage and analysis.
6. Characteristics - Vs of Big Data
• Quality of the
data
• Structured, Un-
structured,
Semi-structured
• Periodic, Near-
time, Real-time
• Terabytes of
data,
Transactions,
Files
Volume Velocity
VeracityVariety
7. Characteristics - Vs of Big Data
Volume - Size of data
• Researchers have predicted that 40 Zettabytes (40,000 Exabytes) will be
generated by 2020, which is an increase of 300 times from 2005.
Velocity
• There are 1.03 billion Daily Active Users (Facebook) on Mobile as of now, which is
an increase of 22% year-over-year.
8. Characteristics - Vs of Big Data
Variety
• Data can be structured, semi-structured or unstructured in the form of images,
audios, videos, sensor data.
• Variety of data creates problems in capturing, storage, mining and analyzing the
data.
Veracity – data uncertainty, data inconsistency and incompleteness
• Due to uncertainty of data, 1 in 3 business leaders don’t trust the information they
use to make decisions.
9. Characteristics - Vs of Big Data
Veracity
• Poor data quality costs the US economy around $3.1 trillion a year.
Value
• Adding to the benefits of the organizations. Is the organization working on Big
Data achieving high ROI?
11. What Comes Under Big Data?
• Black Box Data: Voices of the flight crew, performance information of the aircraft.
• Social Media Data: Posts and views by millions of people across the globe.
• Stock Exchange Data: information about the ‘buy’ and ‘sell’ decisions made on the
stock exchange.
• Power Grid Data: Electricity consumed by a particular node with respect to a base
station.
• Transport Data: Shipping / freight data.
12. Examples of Big Data
• Walmart handles more than 1 million customer transactions every hour.
• Facebook stores, accesses, and analyzes 30+ Petabytes of user generated data.
• 230+ millions of tweets are created every day.
• YouTube users upload 48 hours of new video every minute of the day.
• Amazon handles 15 million customer click stream user data per day to recommend
products.
• 294 billion emails are sent every day. Services analyses this data to find the spams.
• Modern cars have close to 100 sensors.
13. Applications of Big Data
• Smarter Healthcare (EHR): Predict the patient’s deteriorating condition in advance.
• Telecom: Reduce data packet loss. Offer personalized plans.
• Retail: Recommendation engines - Suggestion based on the browsing history of the
consumer.
• Traffic control: managing traffic better via effective use of data and sensors.
• Manufacturing: reduce component defects, improve product quality, increase
efficiency, and save time and money.
• Search Quality: Personalised search results based on previous searches.
14. AlphaGo is a narrow AI developed by Alphabet Inc. to play the board game named Go.
AlphaGo's algorithm uses a tree search algorithm to find its moves based on previously
learned knowledge.
It gains knowledge by machine learning, specifically by an artificial neural network from
extensive training, both from human and computer play.
In March 2016, it beat a professional GO player named Lee Sedol in a five-game match for
the first time.
15. Google Photos
Google implements different forms of machine learning into the Photos service,
particularly in recognition of photo contents.
People: Photos app collects all the photos containing faces. It doesn’t identify these
people, but just collects them for quick access
Places: The relies on landmarks. It can correctly identify well-known places like Taj
Mahal.
Things: This feature aggregates pictures of things like, flowers, cars, sky, birthdays and
cats.There are many more categories, including screenshots, posters and castles.
Not everything in the garden is rosy!
16. Challenges with Big Data
• Data Quality: messy, inconsistent and incomplete data. Dirty data cost $600 billion to
the companies each year in the United States.
• Discovery: Analyzing petabytes of data using powerful algorithms to find patterns and
insights are very difficult.
• Storage: The more data an organization has, the more complex the problem of
managing it.
• Lack of Talent: A developers, data scientists and analysts who also have sufficient
amount of domain knowledge.
17. Summary of Big Data
Big Data is defined as data that is huge in size. Bigdata is a term used to describe a
collection of data that is huge in size and yet growing exponentially with time.
Examples of Big Data generation includes stock exchanges, social media sites, jet
engines, etc.
Big Data could be 1) Structured, 2) Unstructured, 3) Semi-structured
Volume,Variety,Velocity, andVariability are few Characteristics of Bigdata
18. Introduction to Hadoop
Hadoop is an open source framework from Apache.
Used to store process and analyze data which are very huge in
volume.
Hadoop is written in Java and is not OLAP (online analytical
processing).
Used for batch/offline processing.
20. Modules of Hadoop
Hadoop Distributed File System: States that the files will be broken into blocks
and stored in nodes over the distributed architecture.
Yet another Resource Negotiator: Used for job scheduling and manage the
cluster.
Map Reduce: Framework which helps programs to do the parallel computation
on data using key value pair.
21. Modules of Hadoop
Map Reduce: The Map task takes input data and converts it into a data set which
can be computed in Key value pair.The output of Map task is consumed by
reduce task and then the out of reducer gives the desired result.
Hadoop Common:These Java libraries are used to start Hadoop and are used by
other Hadoop modules.
26. Hadoop Architecture
NameNode
• Stores metadata for the files, like the directory structure of a typical FS.
• The server holding the NameNode instance is quite crucial, as there is only one.
• Transaction log for file deletes/adds,etc. Does not use transaction for whole
blocks or file-streams, only metadata.
• Handles creation of more replica blocks when necessary after a DataNode
failure.
27. Hadoop Architecture
DataNode
• Stores the actual data in HDFS
• Can run on any underlying filesystem (ext3/4, NTFS, etc)
• Notifies NameNode of what blocks it has
• NameNode replicates blocks 2x in local rack, 1x elsewhere
28. Hadoop Architecture
Job Tracker
The role of Job Tracker is to accept the MapReduce jobs from client and process the
data by using NameNode. In response, NameNode provides metadata to Job
Tracker.
Task Tracker
It works as a slave node for Job Tracker. It receives task and code from Job Tracker
and applies that code on the file. This process can also be called as a Mapper.
29. Hadoop vs RDBMS
Until recently many applications utilized RDBMS for batch processing – Oracle,
Sybase, MySQL, Microsoft SQL Server, etc.
Hadoop doesn’t fully replace relational products; many architectures would
benefit from both Hadoop and a Relational product(s).
30. Hadoop vs RDBMS
RDBMS products scale up
• Expensive to scale for larger installations
• Hits a ceiling when storage reaches 100s of terabytes
Hadoop clusters can scale-out to 100s of machines and to petabytes of storage.
31. Comparison to RDBMS
Hadoop was not designed for real-time or low latency queries
Products that do provide low latency queries such as HBase have limited query
functionality
Hadoop performs best for offline batch processing on large amounts of data
RDBMS is best for online transactions and low-latency queries
Hadoop is designed to stream large files and large amounts of data
RDBMS works best with small records
32. Hadoop Distributions
Let’s say you go download Hadoop’s HDFS and MapReduce.
At first it works great but then you decide to start using Hbase.
No problem, just download HBase and point it to your existing HDFS.
But you find that HBase can only work with a previous version of HDFS.
You go downgrade HDFS and everything still works great.
33. Hadoop Distributions
Hadoop Distributions aim to resolve version incompatibilities.
DistributionVendor will do the following:
1. IntegrationTest a set of Hadoop products.
2. Distributions may provide additional scripts to execute Hadoop
3. Package Hadoop products in various installation formats
1. Linux Packages, tarballs
35. Distribution Vendors
Cloudera Hadoop Distribution
MapR Hadoop Distribution
AmazonWeb Services Elastic MapReduce Hadoop Distribution
Hortonworks Hadoop Distribution
IBM Infosphere BigInsights Hadoop Distribution
Microsoft Azure's HDInsight Cloud based Hadoop Distribution
36. Cloudera Distribution for Hadoop
Most popular distribution
Cloudera has taken the lead on providing Hadoop Distribution
CDH is provided in various formats
Linux Packages,Virtual Machine Images, andTarballs
AmazonWeb Services Elastic MapReduce Hadoop Distribution.
Integrates HDFS, MapReduce, HBase, Hive, Mahout, Oozie, Pig,
Sqoop,Whirr, Zookeeper, Flume
37. References
• McAfee, A. (2012). Big Data:The Management Revolution. Harvard Business Review.
• Zettaset. (2010). What is Big Data andWhy Do Organizations Need It? Retrieved from Zettaset
Corporate: http://www.zettaset.com/index.php/info-center/what-is-big-data