SlideShare a Scribd company logo
1 of 40
Download to read offline
Big Data for the rest of us.
Dhaval Anjaria
Part 0: Before we Begin
It would be nice if you knew...
● Virtual Machines and virtualization.
● Linux command line and SSH.
● What a Server, a rack and a Datacenter looks like
● SQL and Databases
● Multi-programing or Multi-threading
Virtual Machines
Linux CLI and SSH
● It is not an outdated format.
● Helps in combining tasks into reusable scripts.
● Reduces load on the machine.
● SSH:
○ SSH is a safe, secure mechanism to access a terminal on a remote machine.
○ SSH uses public/private key infrastructure which makes it completely secure.
○ SSH can be used to remotely manage any machine as long as you can connect to it.
DCs and servers and racks! Oh my!
Databases
● Relational Database Management Systems use tables that have some
connection (or relation) to one another to store data.
● Each table has a primary key that identifies each unique row and foriegn keys
which are columns connected to the other tables.
● Data is organized according to this structure and is interacted with using SQL,
which is sort of a programming language for asking tables for specific data.
● eg: SELECT name, age, dob FROM users where DOB >= ‘31-12-1997’ ORDER
BY NAME ASC;
Tables and ER Diagrams
Multi-threading
Part 1: Big Data and Data
Warehousing
How Big is Big Data?
● 100s of Terabytes of Data
● Billions of records
● Millions of concurrent users.
● Not always structured, organized data.
● Volume: Amount of Data
● Velocity: It is in real-time, so results are expected immediately, no matter
what.
● Variety: Big data draws from text, images, audio, sensor data, log data, so on
and so forth.
What do we do with this data?
● Data itself is not very useful. We need to convert it to information.
● This is where Data Warehousing comes in.
● Data mining is finding out particular things about the data.
● Contains telemetry, logging, monitoring, archiving, monitoring, record-keeping
information.
● This is where DSS comes in.
● You want interactive, understandable visualizations of data so that support
management, so that they can decide what to do next.
● It provides totals, aggregates, averages, trends and so on.
How do we organize this data?
● Facts and Dimensions come in.
● So we have a fact, which is the particular data or entity we want to look at.
● This fact will have dimensions, which can be other tables or entities related to
that one entity.
● So we have a Fact table, and dimensions which are related to that table.
Facts and Dimensions
How do we get this data?
● Point of sale records, survey records, logs, geography, monitoring, all the little
transactions serve as sources for a Data Warehouse.
● This is where ETL comes in. ETL stands for Extract - Transform - Load.
● This means that first you get data from a source. This can be sources like
ERP systems, CRM systems, monitoring systems, server logs, traditional
DBMS, etc and then store it into your data warehouse and data marts,
according to your organization
ETL
● ETL stands for Extraction, Transformation and Loading.
● First we get data from ERP, CRM, Departments, logs, surveys and all the
various operational and day-to-day records from various offices and
departments and stores or whatever you have.
● Transformation: We need to make it understandable. So we convert into
whatever tables or formats we need..
● Load: Load is the actual loading and inserting into the database.
● Eg: To store location data, you might need to convert it to strings or integers
and then loading it into the proper tables in the database.
● Informatica, for example, is an ETL provider.
Data Marts and Data Warehouses
Hadoop
The problem we are trying to solve.
● We have lot of Data. Terabytes and Petabytes of Data.
● Say we want to analyze this data, we have to sort through it, we have to
average it and find out the mode and do regression analysis and so forth.
● How do we do it effectively for millions or even billions of records?
● The answer is distribution.
Bahubali statue scene.
● This scene is great. It shows us how awesome Bahubali is.
● Thing is, this sucks for the guy in charge of erecting the statue.
● It was luck that Bahubali showed up, otherwise the entire statue would’ve
fallen, killed a bunch of people and would’ve shattered into a million pieces
(probably).
● How do we do it without Bahubali?
● Bahubalis are expensive. Bahubalis are rare. If Bahubali falls sick, we have no
statue. If he gets hurt, no statue. If he’s busy trying to impress the chick, no
statue.
An alternative solution.
● MORE ROPES. MORE PEOPLE.
In the beginning, there was the GFS
● GFS was Google’s Distributed File System. This meant that instead of having
one file on one PC. You combine all the hard disks and use them as one big
hard disk.
● This is how Google initially gained its speed and performance - by using
commodity hardware and using it together.
● Google really evolved out of research on distributed computing. This is how
Google conquered the world.
What is Hadoop?
● Hadoop, simply put, is a system. It is a set of components working together to
achieve a certain goal.
● Which means that Hadoop is an entire ecosystem of software and programs
working together.
● It is more like a large System rather than just any piece of software.
What Hadoop is made of.
● Core components are MapReduce and HDFS.
● MapReduce is a programming paradigm commercialized and made viable by
Google that distributes workloads over several computers.
● HDFS is a distributed file system, which means that it splits up files across
multiple machines across your network.
The Hadoop Distributed File System.
● Hadoop uses an file system abstraction built in Java to store files.
● HDFS distributes files over several computers replicating them and splitting
them according to what it needs.
● HDFS uses a master slave architecture to monitor and keep track of the
various nodes working together.
● The interface to HDFS is not a simple system call or something like that. One
has to use the Hadoop APIs or specific commands written in java to access
these files.
Advantages of the HDFS.
● It is scalable, fault-tolerant, write-once-read-many, file system that leverages
MapReduce to effectively distribute and retrieve data.
● When processing, Hadoop MapReduce and HDFS work together so that the
data that is being currently worked on is stored on the same node that is
processing that data.
● Provides automatic redundancy (backups), monitoring, diagnostics and so on
so that one sysadmin can monitor 1000s of nodes.
A little deeper
● HDFS configuration consists of two kinds of nodes: The NameNode and the
DataNode.
● The NameNodes contain the metadata which means it contains data about
what nodes store what data. These are generally the master nodes that also
manage the other nodes (i.e. starting, stopping and so on).
● The DataNodes (ideally) contain only the data and do not contain metadata
information about the other nodes.
● They register with the NameNode after which the NameNode assigns them
actual data.
How data is distributed and managed.
MapReduce
● MapReduce is a programming paradigm that has its roots in functional
programming.
● Initially only a research topic, it was made commercially available by Google
in the early 2000s.
● MapReduce uses two basic ideas: map and reduce which are used to
distribute data and load into smaller chunks for processing.
Example of a MapReduce.
How MapReduce works
A little more detail
● MapReduce in Hadoop is provided by YARN, which is Hadoop’s current
MapReduce implementation.
● The name nodes have a JobTracker while the data nodes have a TaskTracker
which track and manage the individual map and reduce jobs.
● Hadoop’s MapReduce works with HDFS so that the data that a job is working
on is on the same node (computer) as the MapReduce process currently
working on that data.
Overall Structure of Hadoop.
What the funny names mean.
● Hive was created as a Non-relational database system so that you didn’t have
to write MapReduce for everything because it runs against HDFS and
MapReduce
● PIG was a higher level of SQL that also leverages MapReduce and HDFS.
● Impala is a low-latency querying application as an alternative to Hive and Pig
to make it faster
● SOLR is to provide super-fast searching.
● SPARK is a new project that is sort of an alternative to MapReduce and other
stuff in memory and provide real-time streaming and machine learning.
Hadoop in the Enterprise
Hadoop: What is it good for?
An example in advertising.
● Advertising is a great use case for Hadoop since modern market research
requires multiple sources like social media, customer habits, trends,
location-based habits, and a multitude of other factors.
● This is a lot of unstructured data that you need to get meaningful information
from.
● Management need low-latency analytics systems to make decisions faster.
● You need low turnaround time and need to keep up with exponential growth.
Both of these tasks are well suited for a Hadoop solution.
An example in Retail.
● Star Bazaar for example figured out that letting the customers pick their own
vegetables was cheaper than pre-packaging them.
● Sears Holdings needed the following query to be answered: How many items
do we have in all our stores above a certain price.
● This, traditionally would take hours or even days.
● 15 billion records, 400 GB of data, “how many items were selling above
$29,999.0?”
● 28 records were returned. Hadoop did it in 53 seconds.
Conclusion.
● Big data needs immediate processing. You achieve this by distributing the
load across multiple machines.
● With enough machines, you can search through billions of records in
seconds.
● Hadoop provides software that does this easily and scales up effectively,
which means you can simply add more machines on the fly.
Thank you.

More Related Content

What's hot

Big data-at-detik
Big data-at-detikBig data-at-detik
Big data-at-detik
k4ndar
 

What's hot (20)

lec2_ref.pdf
lec2_ref.pdflec2_ref.pdf
lec2_ref.pdf
 
Hadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | SysforeHadoop and Big Data Analytics | Sysfore
Hadoop and Big Data Analytics | Sysfore
 
Big Data Tutorial V4
Big Data Tutorial V4Big Data Tutorial V4
Big Data Tutorial V4
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
Hadoop
HadoopHadoop
Hadoop
 
Big data-at-detik
Big data-at-detikBig data-at-detik
Big data-at-detik
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Big Data Analytics V2
Big Data Analytics V2Big Data Analytics V2
Big Data Analytics V2
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 

Similar to Big data for the rest of us with hadoop

Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best PracticesNon-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Jyrki Määttä
 

Similar to Big data for the rest of us with hadoop (20)

Learn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceLearn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant Resource
 
InternReport
InternReportInternReport
InternReport
 
Non geeks-big-data-playbook-106947
Non geeks-big-data-playbook-106947Non geeks-big-data-playbook-106947
Non geeks-big-data-playbook-106947
 
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best PracticesNon-geek's big data playbook - Hadoop & EDW - SAS Best Practices
Non-geek's big data playbook - Hadoop & EDW - SAS Best Practices
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Hadoop Training Tutorial for Freshers
Hadoop Training Tutorial for FreshersHadoop Training Tutorial for Freshers
Hadoop Training Tutorial for Freshers
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
Big data
Big dataBig data
Big data
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
paper
paperpaper
paper
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
 

Recently uploaded

Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
gajnagarg
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 

Recently uploaded (20)

Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎️9352988975 Two shot with one girl...
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 

Big data for the rest of us with hadoop

  • 1. Big Data for the rest of us. Dhaval Anjaria
  • 2. Part 0: Before we Begin
  • 3. It would be nice if you knew... ● Virtual Machines and virtualization. ● Linux command line and SSH. ● What a Server, a rack and a Datacenter looks like ● SQL and Databases ● Multi-programing or Multi-threading
  • 5. Linux CLI and SSH ● It is not an outdated format. ● Helps in combining tasks into reusable scripts. ● Reduces load on the machine. ● SSH: ○ SSH is a safe, secure mechanism to access a terminal on a remote machine. ○ SSH uses public/private key infrastructure which makes it completely secure. ○ SSH can be used to remotely manage any machine as long as you can connect to it.
  • 6. DCs and servers and racks! Oh my!
  • 7. Databases ● Relational Database Management Systems use tables that have some connection (or relation) to one another to store data. ● Each table has a primary key that identifies each unique row and foriegn keys which are columns connected to the other tables. ● Data is organized according to this structure and is interacted with using SQL, which is sort of a programming language for asking tables for specific data. ● eg: SELECT name, age, dob FROM users where DOB >= ‘31-12-1997’ ORDER BY NAME ASC;
  • 8. Tables and ER Diagrams
  • 10. Part 1: Big Data and Data Warehousing
  • 11. How Big is Big Data? ● 100s of Terabytes of Data ● Billions of records ● Millions of concurrent users. ● Not always structured, organized data. ● Volume: Amount of Data ● Velocity: It is in real-time, so results are expected immediately, no matter what. ● Variety: Big data draws from text, images, audio, sensor data, log data, so on and so forth.
  • 12. What do we do with this data? ● Data itself is not very useful. We need to convert it to information. ● This is where Data Warehousing comes in. ● Data mining is finding out particular things about the data. ● Contains telemetry, logging, monitoring, archiving, monitoring, record-keeping information. ● This is where DSS comes in. ● You want interactive, understandable visualizations of data so that support management, so that they can decide what to do next. ● It provides totals, aggregates, averages, trends and so on.
  • 13. How do we organize this data? ● Facts and Dimensions come in. ● So we have a fact, which is the particular data or entity we want to look at. ● This fact will have dimensions, which can be other tables or entities related to that one entity. ● So we have a Fact table, and dimensions which are related to that table.
  • 15. How do we get this data? ● Point of sale records, survey records, logs, geography, monitoring, all the little transactions serve as sources for a Data Warehouse. ● This is where ETL comes in. ETL stands for Extract - Transform - Load. ● This means that first you get data from a source. This can be sources like ERP systems, CRM systems, monitoring systems, server logs, traditional DBMS, etc and then store it into your data warehouse and data marts, according to your organization
  • 16. ETL ● ETL stands for Extraction, Transformation and Loading. ● First we get data from ERP, CRM, Departments, logs, surveys and all the various operational and day-to-day records from various offices and departments and stores or whatever you have. ● Transformation: We need to make it understandable. So we convert into whatever tables or formats we need.. ● Load: Load is the actual loading and inserting into the database. ● Eg: To store location data, you might need to convert it to strings or integers and then loading it into the proper tables in the database. ● Informatica, for example, is an ETL provider.
  • 17. Data Marts and Data Warehouses
  • 19. The problem we are trying to solve. ● We have lot of Data. Terabytes and Petabytes of Data. ● Say we want to analyze this data, we have to sort through it, we have to average it and find out the mode and do regression analysis and so forth. ● How do we do it effectively for millions or even billions of records? ● The answer is distribution.
  • 20. Bahubali statue scene. ● This scene is great. It shows us how awesome Bahubali is. ● Thing is, this sucks for the guy in charge of erecting the statue. ● It was luck that Bahubali showed up, otherwise the entire statue would’ve fallen, killed a bunch of people and would’ve shattered into a million pieces (probably). ● How do we do it without Bahubali? ● Bahubalis are expensive. Bahubalis are rare. If Bahubali falls sick, we have no statue. If he gets hurt, no statue. If he’s busy trying to impress the chick, no statue.
  • 21. An alternative solution. ● MORE ROPES. MORE PEOPLE.
  • 22. In the beginning, there was the GFS ● GFS was Google’s Distributed File System. This meant that instead of having one file on one PC. You combine all the hard disks and use them as one big hard disk. ● This is how Google initially gained its speed and performance - by using commodity hardware and using it together. ● Google really evolved out of research on distributed computing. This is how Google conquered the world.
  • 23. What is Hadoop? ● Hadoop, simply put, is a system. It is a set of components working together to achieve a certain goal. ● Which means that Hadoop is an entire ecosystem of software and programs working together. ● It is more like a large System rather than just any piece of software.
  • 24. What Hadoop is made of. ● Core components are MapReduce and HDFS. ● MapReduce is a programming paradigm commercialized and made viable by Google that distributes workloads over several computers. ● HDFS is a distributed file system, which means that it splits up files across multiple machines across your network.
  • 25. The Hadoop Distributed File System. ● Hadoop uses an file system abstraction built in Java to store files. ● HDFS distributes files over several computers replicating them and splitting them according to what it needs. ● HDFS uses a master slave architecture to monitor and keep track of the various nodes working together. ● The interface to HDFS is not a simple system call or something like that. One has to use the Hadoop APIs or specific commands written in java to access these files.
  • 26. Advantages of the HDFS. ● It is scalable, fault-tolerant, write-once-read-many, file system that leverages MapReduce to effectively distribute and retrieve data. ● When processing, Hadoop MapReduce and HDFS work together so that the data that is being currently worked on is stored on the same node that is processing that data. ● Provides automatic redundancy (backups), monitoring, diagnostics and so on so that one sysadmin can monitor 1000s of nodes.
  • 27. A little deeper ● HDFS configuration consists of two kinds of nodes: The NameNode and the DataNode. ● The NameNodes contain the metadata which means it contains data about what nodes store what data. These are generally the master nodes that also manage the other nodes (i.e. starting, stopping and so on). ● The DataNodes (ideally) contain only the data and do not contain metadata information about the other nodes. ● They register with the NameNode after which the NameNode assigns them actual data.
  • 28. How data is distributed and managed.
  • 29. MapReduce ● MapReduce is a programming paradigm that has its roots in functional programming. ● Initially only a research topic, it was made commercially available by Google in the early 2000s. ● MapReduce uses two basic ideas: map and reduce which are used to distribute data and load into smaller chunks for processing.
  • 30. Example of a MapReduce.
  • 32. A little more detail ● MapReduce in Hadoop is provided by YARN, which is Hadoop’s current MapReduce implementation. ● The name nodes have a JobTracker while the data nodes have a TaskTracker which track and manage the individual map and reduce jobs. ● Hadoop’s MapReduce works with HDFS so that the data that a job is working on is on the same node (computer) as the MapReduce process currently working on that data.
  • 34. What the funny names mean. ● Hive was created as a Non-relational database system so that you didn’t have to write MapReduce for everything because it runs against HDFS and MapReduce ● PIG was a higher level of SQL that also leverages MapReduce and HDFS. ● Impala is a low-latency querying application as an alternative to Hive and Pig to make it faster ● SOLR is to provide super-fast searching. ● SPARK is a new project that is sort of an alternative to MapReduce and other stuff in memory and provide real-time streaming and machine learning.
  • 35. Hadoop in the Enterprise
  • 36. Hadoop: What is it good for?
  • 37. An example in advertising. ● Advertising is a great use case for Hadoop since modern market research requires multiple sources like social media, customer habits, trends, location-based habits, and a multitude of other factors. ● This is a lot of unstructured data that you need to get meaningful information from. ● Management need low-latency analytics systems to make decisions faster. ● You need low turnaround time and need to keep up with exponential growth. Both of these tasks are well suited for a Hadoop solution.
  • 38. An example in Retail. ● Star Bazaar for example figured out that letting the customers pick their own vegetables was cheaper than pre-packaging them. ● Sears Holdings needed the following query to be answered: How many items do we have in all our stores above a certain price. ● This, traditionally would take hours or even days. ● 15 billion records, 400 GB of data, “how many items were selling above $29,999.0?” ● 28 records were returned. Hadoop did it in 53 seconds.
  • 39. Conclusion. ● Big data needs immediate processing. You achieve this by distributing the load across multiple machines. ● With enough machines, you can search through billions of records in seconds. ● Hadoop provides software that does this easily and scales up effectively, which means you can simply add more machines on the fly.