This is a gentle introduction to Hadoop and Big Data in general. It shows how MapReduce and HDFS work along with various features of Data Warehousing and Big Data.
It also shows how Scalability works and how Cloud Technologies can be leveraged to achieve it.
3. It would be nice if you knew...
● Virtual Machines and virtualization.
● Linux command line and SSH.
● What a Server, a rack and a Datacenter looks like
● SQL and Databases
● Multi-programing or Multi-threading
5. Linux CLI and SSH
● It is not an outdated format.
● Helps in combining tasks into reusable scripts.
● Reduces load on the machine.
● SSH:
○ SSH is a safe, secure mechanism to access a terminal on a remote machine.
○ SSH uses public/private key infrastructure which makes it completely secure.
○ SSH can be used to remotely manage any machine as long as you can connect to it.
7. Databases
● Relational Database Management Systems use tables that have some
connection (or relation) to one another to store data.
● Each table has a primary key that identifies each unique row and foriegn keys
which are columns connected to the other tables.
● Data is organized according to this structure and is interacted with using SQL,
which is sort of a programming language for asking tables for specific data.
● eg: SELECT name, age, dob FROM users where DOB >= ‘31-12-1997’ ORDER
BY NAME ASC;
11. How Big is Big Data?
● 100s of Terabytes of Data
● Billions of records
● Millions of concurrent users.
● Not always structured, organized data.
● Volume: Amount of Data
● Velocity: It is in real-time, so results are expected immediately, no matter
what.
● Variety: Big data draws from text, images, audio, sensor data, log data, so on
and so forth.
12. What do we do with this data?
● Data itself is not very useful. We need to convert it to information.
● This is where Data Warehousing comes in.
● Data mining is finding out particular things about the data.
● Contains telemetry, logging, monitoring, archiving, monitoring, record-keeping
information.
● This is where DSS comes in.
● You want interactive, understandable visualizations of data so that support
management, so that they can decide what to do next.
● It provides totals, aggregates, averages, trends and so on.
13. How do we organize this data?
● Facts and Dimensions come in.
● So we have a fact, which is the particular data or entity we want to look at.
● This fact will have dimensions, which can be other tables or entities related to
that one entity.
● So we have a Fact table, and dimensions which are related to that table.
15. How do we get this data?
● Point of sale records, survey records, logs, geography, monitoring, all the little
transactions serve as sources for a Data Warehouse.
● This is where ETL comes in. ETL stands for Extract - Transform - Load.
● This means that first you get data from a source. This can be sources like
ERP systems, CRM systems, monitoring systems, server logs, traditional
DBMS, etc and then store it into your data warehouse and data marts,
according to your organization
16. ETL
● ETL stands for Extraction, Transformation and Loading.
● First we get data from ERP, CRM, Departments, logs, surveys and all the
various operational and day-to-day records from various offices and
departments and stores or whatever you have.
● Transformation: We need to make it understandable. So we convert into
whatever tables or formats we need..
● Load: Load is the actual loading and inserting into the database.
● Eg: To store location data, you might need to convert it to strings or integers
and then loading it into the proper tables in the database.
● Informatica, for example, is an ETL provider.
19. The problem we are trying to solve.
● We have lot of Data. Terabytes and Petabytes of Data.
● Say we want to analyze this data, we have to sort through it, we have to
average it and find out the mode and do regression analysis and so forth.
● How do we do it effectively for millions or even billions of records?
● The answer is distribution.
20. Bahubali statue scene.
● This scene is great. It shows us how awesome Bahubali is.
● Thing is, this sucks for the guy in charge of erecting the statue.
● It was luck that Bahubali showed up, otherwise the entire statue would’ve
fallen, killed a bunch of people and would’ve shattered into a million pieces
(probably).
● How do we do it without Bahubali?
● Bahubalis are expensive. Bahubalis are rare. If Bahubali falls sick, we have no
statue. If he gets hurt, no statue. If he’s busy trying to impress the chick, no
statue.
22. In the beginning, there was the GFS
● GFS was Google’s Distributed File System. This meant that instead of having
one file on one PC. You combine all the hard disks and use them as one big
hard disk.
● This is how Google initially gained its speed and performance - by using
commodity hardware and using it together.
● Google really evolved out of research on distributed computing. This is how
Google conquered the world.
23. What is Hadoop?
● Hadoop, simply put, is a system. It is a set of components working together to
achieve a certain goal.
● Which means that Hadoop is an entire ecosystem of software and programs
working together.
● It is more like a large System rather than just any piece of software.
24. What Hadoop is made of.
● Core components are MapReduce and HDFS.
● MapReduce is a programming paradigm commercialized and made viable by
Google that distributes workloads over several computers.
● HDFS is a distributed file system, which means that it splits up files across
multiple machines across your network.
25. The Hadoop Distributed File System.
● Hadoop uses an file system abstraction built in Java to store files.
● HDFS distributes files over several computers replicating them and splitting
them according to what it needs.
● HDFS uses a master slave architecture to monitor and keep track of the
various nodes working together.
● The interface to HDFS is not a simple system call or something like that. One
has to use the Hadoop APIs or specific commands written in java to access
these files.
26. Advantages of the HDFS.
● It is scalable, fault-tolerant, write-once-read-many, file system that leverages
MapReduce to effectively distribute and retrieve data.
● When processing, Hadoop MapReduce and HDFS work together so that the
data that is being currently worked on is stored on the same node that is
processing that data.
● Provides automatic redundancy (backups), monitoring, diagnostics and so on
so that one sysadmin can monitor 1000s of nodes.
27. A little deeper
● HDFS configuration consists of two kinds of nodes: The NameNode and the
DataNode.
● The NameNodes contain the metadata which means it contains data about
what nodes store what data. These are generally the master nodes that also
manage the other nodes (i.e. starting, stopping and so on).
● The DataNodes (ideally) contain only the data and do not contain metadata
information about the other nodes.
● They register with the NameNode after which the NameNode assigns them
actual data.
29. MapReduce
● MapReduce is a programming paradigm that has its roots in functional
programming.
● Initially only a research topic, it was made commercially available by Google
in the early 2000s.
● MapReduce uses two basic ideas: map and reduce which are used to
distribute data and load into smaller chunks for processing.
32. A little more detail
● MapReduce in Hadoop is provided by YARN, which is Hadoop’s current
MapReduce implementation.
● The name nodes have a JobTracker while the data nodes have a TaskTracker
which track and manage the individual map and reduce jobs.
● Hadoop’s MapReduce works with HDFS so that the data that a job is working
on is on the same node (computer) as the MapReduce process currently
working on that data.
34. What the funny names mean.
● Hive was created as a Non-relational database system so that you didn’t have
to write MapReduce for everything because it runs against HDFS and
MapReduce
● PIG was a higher level of SQL that also leverages MapReduce and HDFS.
● Impala is a low-latency querying application as an alternative to Hive and Pig
to make it faster
● SOLR is to provide super-fast searching.
● SPARK is a new project that is sort of an alternative to MapReduce and other
stuff in memory and provide real-time streaming and machine learning.
37. An example in advertising.
● Advertising is a great use case for Hadoop since modern market research
requires multiple sources like social media, customer habits, trends,
location-based habits, and a multitude of other factors.
● This is a lot of unstructured data that you need to get meaningful information
from.
● Management need low-latency analytics systems to make decisions faster.
● You need low turnaround time and need to keep up with exponential growth.
Both of these tasks are well suited for a Hadoop solution.
38. An example in Retail.
● Star Bazaar for example figured out that letting the customers pick their own
vegetables was cheaper than pre-packaging them.
● Sears Holdings needed the following query to be answered: How many items
do we have in all our stores above a certain price.
● This, traditionally would take hours or even days.
● 15 billion records, 400 GB of data, “how many items were selling above
$29,999.0?”
● 28 records were returned. Hadoop did it in 53 seconds.
39. Conclusion.
● Big data needs immediate processing. You achieve this by distributing the
load across multiple machines.
● With enough machines, you can search through billions of records in
seconds.
● Hadoop provides software that does this easily and scales up effectively,
which means you can simply add more machines on the fly.