The document provides an introduction to Apache Hadoop, including:
1) It describes Hadoop's architecture which uses HDFS for distributed storage and MapReduce for distributed processing of large datasets across commodity clusters.
2) It explains that Hadoop solves issues of hardware failure and combining data through replication of data blocks and a simple MapReduce programming model.
3) It gives a brief history of Hadoop originating from Doug Cutting's Nutch project and the influence of Google's papers on distributed file systems and MapReduce.
Accompanying slides for the class “Introduction to Hadoop” at the PRACE Autumn school 2020 - HPC and FAIR Big Data organized by the faculty of Mechanical Engineering of the University of Ljubljana (Slovenia).
PRACE Autumn school 2021 - Big Data with Hadoop and Keras
27-30 September 2021
Fakulteta za strojništvo
Europe/Ljubljana
Data and scripts are available at: https://www.events.prace-ri.eu/event/1226/timetable/
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
Accompanying slides for the class “Introduction to Hadoop” at the PRACE Autumn school 2020 - HPC and FAIR Big Data organized by the faculty of Mechanical Engineering of the University of Ljubljana (Slovenia).
PRACE Autumn school 2021 - Big Data with Hadoop and Keras
27-30 September 2021
Fakulteta za strojništvo
Europe/Ljubljana
Data and scripts are available at: https://www.events.prace-ri.eu/event/1226/timetable/
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
Accompanying slides for the class “Introduction to Hadoop” at the PRACE Autumn school 2020 - HPC and FAIR Big Data organized by the faculty of Mechanical Engineering of the University of Ljubljana (Slovenia).
Modern applications, often called “big-data” analysis, require us to manage immense amounts of data quickly. To deal with applications such as these, a new software stack has evolved.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
These slides cover the very basics of Hadoop architecture, in particular HDFS. This was my presentation in the first Delhi Hadoop User Group (DHUG) meetup held at Gurgaon on 10th September 2011. Loved the positive feedback. I'll also upload a more elaborate version covering Hadoop mapreduce architecture as well soon. Most of the stuff covered in these slides can be found in Tom White's book as well (See the last slide)
Yahoo! Hadoop grid makes use of a managed service to get the data pulled into the clusters. However, when it comes to getting the data-out of the clusters, the choices are limited to proxies such as HDFSProxy and HTTPProxy. With the introduction of HCatalog services, customers of the grid now have their data represented in a central metadata repository. HCatalog abstracts out file locations and underlying storage format of data for the users, along with several other advantages such as sharing of data among MapReduce, Pig, and Hive. In this talk, we will focus on how the ODBC/JDBC interface of HiveServer2 accomplished the use case of getting data out of the clusters when HCatalog is in use and users no longer want to worry about the files, partitions and their location. We will also demo the data out capabilities, and go through other nice properties of the data out feature.
Presenter(s):
Sumeet Singh, Director, Product Management, Yahoo!
Chris Drome, Technical Yahoo!
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
As organizations begin to make use of large data sets, approaches to understand and manage true costs of big data will become an important facet with increasing scale of operations.
Whether an on-premise or cloud-based platform is used for storing, processing and analyzing data, our approach explains how to calculate the total cost of ownership (TCO), develop a deeper understanding of compute and storage resources, and run the big data operations with its own P&L, full transparency in costs, and with metering and billing provisions. While our approach is generic, we will illustrate the methodology with three primary deployments in the Apache Hadoop ecosystem, namely MapReduce and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively.
As we discuss our approach, we will share insights gathered from the exercise conducted on one of the largest data infrastructures in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs of usage, measure resources consumed, optimize for higher utilization and ROI, and benchmark the cost.
r packagesdata analytics study material;
learn data analytics online;
data analytics courses;
courses for data analysis;
courses for data analytics;
online data analysis courses;
courses on data analysis;
data analytics classes;
data analysis training courses online;
courses in data analysis;
data analysis courses online;
data analytics training;
courses for data analyst;
data analysis online course;
data analysis certification;
data analysis courses;
data analysis classes;
online course data analysis;
learn data analysis online;
data analysis training;
python for data analysis course;
learn data analytics;
study data analytics;
how to learn data analytics;
data analysis course free;
statistical methods and data analysis;
big data analytics;
data analysis companies;
python data analysis course;
tools that can be used to analyse data;
data analysis consulting;
basic data analytics;
data analysis programs;
examples of data analysis tools;
big data analysis tools;
data analytics tools and techniques;
statistics for data analytics;
data analytics tools;
data analytics and big data;
data analytics big data;
data analysis software;
data analytics with excel;
website data analysis;
data analytics companies;
data analysis qualifications;
tools for data analytics;
data analysis tools;
qualitative data analysis software;
free data analytics;
data analysis website;
tools for analyzing data;
data analytics software;
free data analysis software;
tools for analysing data;
data mining book;
learn data analysis;
about data analytics;
statistical data analysis software;
it data analytics;
data analytics tutorial for beginners;
unstructured data analytics;
data analytics using excel;
dissertation data analysis;
sample of data analysis;
data analysis online;
data analytics;
tools of data analysis;
analytical tools for data analysis;
statistical tools to analyse data;
data analysis help;
data analysis education;
statistical technique for data analysis;
tools for data analysis;
how to learn data analysis;
data analytics tutorial;
excel data analytics;
data mining course;
data analysis software free;
big data and data analytics;
statistical analysis software;
tools to analyse data;
online data analysis;
data mining software;
data analytics statistics;
how to do data analytics;
statistical data analysis tools;
data analyst tools;
business data analysis;
tools and techniques of data analysis;
education data analysis;
advanced data analytics;
study data analysis;
spreadsheet data analysis;
learn data analysis in excel;
software for data analysis;
shared data warehouse;
what are data analysis tools;
data analytics and statistics;
data analyse;
analysis courses;
data analysis tools for research;
research data analysis tools;
big data analysis;
data mining programs;
applications of data analytics;
data analysis tools and techniques;
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
Accompanying slides for the class “Introduction to Hadoop” at the PRACE Autumn school 2020 - HPC and FAIR Big Data organized by the faculty of Mechanical Engineering of the University of Ljubljana (Slovenia).
Modern applications, often called “big-data” analysis, require us to manage immense amounts of data quickly. To deal with applications such as these, a new software stack has evolved.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
These slides cover the very basics of Hadoop architecture, in particular HDFS. This was my presentation in the first Delhi Hadoop User Group (DHUG) meetup held at Gurgaon on 10th September 2011. Loved the positive feedback. I'll also upload a more elaborate version covering Hadoop mapreduce architecture as well soon. Most of the stuff covered in these slides can be found in Tom White's book as well (See the last slide)
Yahoo! Hadoop grid makes use of a managed service to get the data pulled into the clusters. However, when it comes to getting the data-out of the clusters, the choices are limited to proxies such as HDFSProxy and HTTPProxy. With the introduction of HCatalog services, customers of the grid now have their data represented in a central metadata repository. HCatalog abstracts out file locations and underlying storage format of data for the users, along with several other advantages such as sharing of data among MapReduce, Pig, and Hive. In this talk, we will focus on how the ODBC/JDBC interface of HiveServer2 accomplished the use case of getting data out of the clusters when HCatalog is in use and users no longer want to worry about the files, partitions and their location. We will also demo the data out capabilities, and go through other nice properties of the data out feature.
Presenter(s):
Sumeet Singh, Director, Product Management, Yahoo!
Chris Drome, Technical Yahoo!
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
As organizations begin to make use of large data sets, approaches to understand and manage true costs of big data will become an important facet with increasing scale of operations.
Whether an on-premise or cloud-based platform is used for storing, processing and analyzing data, our approach explains how to calculate the total cost of ownership (TCO), develop a deeper understanding of compute and storage resources, and run the big data operations with its own P&L, full transparency in costs, and with metering and billing provisions. While our approach is generic, we will illustrate the methodology with three primary deployments in the Apache Hadoop ecosystem, namely MapReduce and HDFS, HBase, and Storm due to the significance of capital investments with increasing scale in data nodes, region servers, and supervisor nodes respectively.
As we discuss our approach, we will share insights gathered from the exercise conducted on one of the largest data infrastructures in the world. We will illustrate how to organize cluster resources, compile data required and typical sources, develop TCO models tailored for individual situations, derive unit costs of usage, measure resources consumed, optimize for higher utilization and ROI, and benchmark the cost.
r packagesdata analytics study material;
learn data analytics online;
data analytics courses;
courses for data analysis;
courses for data analytics;
online data analysis courses;
courses on data analysis;
data analytics classes;
data analysis training courses online;
courses in data analysis;
data analysis courses online;
data analytics training;
courses for data analyst;
data analysis online course;
data analysis certification;
data analysis courses;
data analysis classes;
online course data analysis;
learn data analysis online;
data analysis training;
python for data analysis course;
learn data analytics;
study data analytics;
how to learn data analytics;
data analysis course free;
statistical methods and data analysis;
big data analytics;
data analysis companies;
python data analysis course;
tools that can be used to analyse data;
data analysis consulting;
basic data analytics;
data analysis programs;
examples of data analysis tools;
big data analysis tools;
data analytics tools and techniques;
statistics for data analytics;
data analytics tools;
data analytics and big data;
data analytics big data;
data analysis software;
data analytics with excel;
website data analysis;
data analytics companies;
data analysis qualifications;
tools for data analytics;
data analysis tools;
qualitative data analysis software;
free data analytics;
data analysis website;
tools for analyzing data;
data analytics software;
free data analysis software;
tools for analysing data;
data mining book;
learn data analysis;
about data analytics;
statistical data analysis software;
it data analytics;
data analytics tutorial for beginners;
unstructured data analytics;
data analytics using excel;
dissertation data analysis;
sample of data analysis;
data analysis online;
data analytics;
tools of data analysis;
analytical tools for data analysis;
statistical tools to analyse data;
data analysis help;
data analysis education;
statistical technique for data analysis;
tools for data analysis;
how to learn data analysis;
data analytics tutorial;
excel data analytics;
data mining course;
data analysis software free;
big data and data analytics;
statistical analysis software;
tools to analyse data;
online data analysis;
data mining software;
data analytics statistics;
how to do data analytics;
statistical data analysis tools;
data analyst tools;
business data analysis;
tools and techniques of data analysis;
education data analysis;
advanced data analytics;
study data analysis;
spreadsheet data analysis;
learn data analysis in excel;
software for data analysis;
shared data warehouse;
what are data analysis tools;
data analytics and statistics;
data analyse;
analysis courses;
data analysis tools for research;
research data analysis tools;
big data analysis;
data mining programs;
applications of data analytics;
data analysis tools and techniques;
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
I have studied on Big Data analysis and found Hadoop is the best technology and most popular as well for it's distributed data processing approaches. I have gathered all possible information about various Hadoop distributions available in the market and tried to describe most important tools and their functionality in the Hadoop echosystems in this slide show. I have also tried to discuss about connectivity with language R interm of data analysis and visualization perspective. Hope you will be enjoying the whole!
If you are search Best Engineering college in India, Then you can trust RCE (Roorkee College of Engineering) services and facilities. They provide the best education facility, highly educated and experienced faculty, well furnished hostels for both boys and girls, top computerized Library, great placement opportunity and more at affordable fee.
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
This is the presentation I used for Oracle Week 2016 session about Apache Spark.
In the agenda:
- The Big Data problem and possible solutions
- Basic Spark Core
- Working with RDDs
- Working with Spark Cluster and Parallel programming
- Spark modules: Spark SQL and Spark Streaming
- Performance and Troubleshooting
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemZohar Elkayam
Session from BGOUG I presented in June, 2016
Big data is one of the biggest buzzword in today's market. Terms like Hadoop, HDFS, YARN, Sqoop, and non-structured data has been scaring DBA's since 2010 - but where does the DBA team really fit in?
In this session, we will discuss everything database administrators and database developers needs to know about big data. We will demystify the Hadoop ecosystem and explore the different components. We will learn how HDFS and MapReduce are changing the data world, and where traditional databases fits into the grand scheme of things. We will also talk about why DBAs are the perfect candidates to transition into Big Data and Hadoop professionals and experts.
We Provide Hadoop training institute in Hyderabad and Bangalore with corporate training by 12+ Experience faculty.
Real-time industry experts from MNCs
Resume Preparation by expert Professionals
Lab exercises
Interview Preparation
Experts advice
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
11. Distributed Computing
• The key issues involved in this Solution:
• Hardware failure
• Combine the data after analysis
• Network Associated Problems
11
12. CAP Theorem
• CAP theorem (or Brewer’s theorem) is a set of
basic requirements that describes a distributed
system
• Consistency: all the server in the system will have
the same data
• Availability: all the server in the system will be
available and they will return all the data available
(also if they could be not consistent across the
system)
• Partition (tolerance): the system will continues to
operate as a whole despite arbitrary message loss
or failure of a part of the system
12
14. CAP Theorem (2)
14
According to
the theorem, a
distributed
system
CANNOT satisfy
all the three
requirements
at the SAME
time (“two out
of three”
concept).
15. Problems In Distributed
Computing
1. Hardware Failure:
• As soon as we start using many pieces of
hardware, the chance that one will fail is
fairly high.
2. Combine the data after analysis:
• Most analysis tasks need to be able to
combine the data in some way; data read
from one disk may need to be combined
with the data from any of the other 99
disks. 15
17. What’s Hadoop?
• “An open source software platform for
distributed storage and distributed
processing of very large data sets on
computer clusters built on commodity
hardware” - Hortonworks.
• Solving the first problem by Avoiding data
loss through replication
• redundant copies of the data are kept by
the system so that in the event of failure 17
18. What’s Hadoop? (cont’d)
• The second problem is solved by a simple
programming model called Mapreduce.
• Hadoop is also a highly popular open
source implementation of MapReduce
• a powerful tool designed for deep analysis
and transformation of very large data sets.
18
19. Hadoop…
• Where it comes from?
• The “legend” says that the name comes from
Doug Cutting (one of the founder of the
project) son’s toy elephant.
• So it is also the logo of the yellow smiling
elephant.
19
20. History
• [2002] Hadoop, created by Doug Cutting
(part of the Lucene project), starts as an
Open Source search engine for the Web.
• It has its origins in Apache Nutch, parts of
the Lucene project (full text search engine).
• [2003] Google publishes a paper
describing its own distributed file system,
also called GFS.
20
21. History (1)
• [2004] The first version of NDFS, Nutch
Distributed FS, implementing the Google’s
paper.
• [2004] Google publishes, another, paper
introducing the MapReduce algorithm
• [2005] The first version of MapReduce is
implemented in Nutch
21
22. History (2)
• [2005 (end)] Nutch’s MapReduce is
running on NDFS
• [2006 (Feb)] Nutch’s MapReduce and
NDFS became the core of a new Lucene’s
subproject.
• [2008] Yahoo launches the World’s largest
Hadoop PRODUCTION site
22
23. Key Features
23
• Automatic parallelization and distribution
• Fault-tolerance
• Data Locality
• Writing the Map and Reduce functions
only
• Single-threaded model
24. Why Hadoop ?
• Cheaper – scalable to Petabyte/Zetabyte
and more with commodity hardware
• Faster – Parallel Processing
• Better – Suitable for particular types of
‘Big Data’ applications
24
25. Right Data
• LOB (Line of Business) – not suitable
• Transactional Data
• Behavioral Data -- suitable
• Web usage
• Shopping behavior
• etc
25
29. HDFS
• The Hadoop Distributed File System
• For a developer point of view it looks like a
standard file system
• Runs on top of OS file system (extf3,…)
• Designed to store a very large amount of
data (petabytes and so on) and to solve
some problems that comes with DFS or NFS
• Provides fast and scalable access to the
data Stores data reliably 29
30. HDFS under the hood
• All the files loaded in Hadoop are split into
chunks, called blocks.
• Each block has a fixed size of 64Mb! (newer
version has default block size of 256 Mb).
MyData ~ 150Mb
HDFS
Blk_01
64Mb
Blk_03, 22Mb
Blk_02
64Mb
30
31. Hadoop cluster
• A Hadoop cluster consist in mainly two
modules:
• A way to store distributed data, the HDFS or
Hadoop Distributed File System (storage
layer)
• A way to process data, the MapReduce
(compute layer).
31
32. Name Node (Master)
• A dedicated node where all the metadata
of all the files (blocks) inside my system
are stored.
• It’s the directory manager of the HDFS
32
33. Data Node (Slave)
• A daemon (a service in the Windows
language) running on each cluster nodes.
• Responsible to store the blocks
33
34. Accessing Data
• To access a file, a client contact the
Namenode to retrieve the list of locations
for the blocks.
• With the locations the client contact the
Datanodes to read the data (possibly in
parallel).
34
35. Data Redundancy
• Hadoop replicates each block THREE times, as
it’s stored in the HDFS.
• The location of every blocks is managed by
the Namenode
• If a block is under-replicated (due to some
failures on a node), the Namenode is smart
enough to create another replica, until each
node has three replica inside the cluster
• E.g. if we have 100Tb of data to store in
Hadoop, we will need 300Tb of storage
space. 35
42. Hadoop 2.0
• Store all data in one place Interact with
data in multiple ways
42
43. Hadoop 2.x
• The new Hadoop has now FOUR modules
(instead of two)
• HadoopCommon: common utilities
supporting all the other modules
• HDFS: an evolution of the previous
distributed FS
• Hadoop YARN: a function for job scheduling
and cluster resource management
• Hadoop MapReduce: a YARN based system
for parallel processing of large data sets 43
44. Hadoop 2.x
• Hadoop v2, leveraging YARN, is aiming to
become the new OS for the data processing
44
45. Hadoop and real time
• Hadoop v2, using YARN, and Storm (a free
and open source distributed real time
computation system) can compute your
data in real time
• Some Hadoop distribution (like
Hortonworks) are working on an effortless
integration
45
49. Namenode availability
• If the Namenode fails ALL the cluster
becomes inaccessible
• In the early versions the Namenode was a
single point of failure
• Couple of solution are now available:
• the Namenode stores the data on the
network through NFS
• most production sites have two Namenode:
Active and Standby
49
50. Hadoop 3.x Features
• Support for Erasure Encoding in HDFS
• YARN Timeline Service v.2
• Shell Script Rewrite
• Shaded Client Jars
• Support for Opportunistic Containers
• MapReduce Task-Level Native
Optimization
50
51. Hadoop 3.x Features
• Support for More than 2 NameNodes
• Default Ports of Multiple Services have
been Changed
• Support for Filesystem Connector
• Intra-DataNode Balancer
• Reworked Daemon and Task Heap
Management
51
54. Hadoop Distributions
Open Source Commercial Cloud
Apache
Hadoop
Cloudera AWS
Hortonworks
Microsoft Azure
HDInsight
MapR DataProc
54
55. Hadoop related projects
• PIG: high level language fro analyzing large
data-sets. It’s working as a compiler that
produce M/R jobs
• HIVE: data warehouse software facilities
querying and managing large data-sets with a
SQL like language
• Hbase : a scalable, distributed database that
supports structured data storage for large
tables
• Cassandra: a scalable multi-master database 55
How to learn
Emphasize on programming with Java
Apology for document
Document is not quite complete
Some parts are irrelevant
Some just get added because of its interesting nature
Some are missing
Some are not part of this documentß
Student must lecture on undocumented details
Exmaple:
In the cloud, on an elastic first level system, the service should be “stateless” or at least “soft-state” (cached) and must always response to the query, even if the backend is down. So the system will be “A”, immediate responsive, and “P”, regardless a failure in the backend the system is responding to the requests
A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. The Hadoop Distributed Filesystem (HDFS), takes care of this problem.
Old: Apache Hadoop is a framework for running applications on large cluster built of commodity hardware.
This is the core of Hadoop!
The NameNode stores modifications to the file system as a log appended to a native file system file, edits. When a NameNode starts up, it reads HDFS state from an image file, fsimage, and then applies edits from the edits log file. It then writes new HDFS state to the fsimage and starts normal operation with an empty edits file