This presentation provides an overview of big data concepts and Hadoop technologies. It discusses what big data is and why it is important for businesses to gain insights from massive data. The key Hadoop technologies explained include HDFS for distributed storage, MapReduce for distributed processing, and various tools that run on top of Hadoop like Hive, Pig, HBase, HCatalog, ZooKeeper and Sqoop. Popular Hadoop SQL databases like Impala, Presto and Stinger are also compared in terms of their performance and capabilities. The document discusses options for deploying Hadoop on-premise or in the cloud and how to integrate Microsoft BI tools with Hadoop for big data analytics.
( EMC World 2012 ) :Apache Hadoop is now enterprise ready. This session reviews the features/roadmap of Hadoop. We will review some of the key capabilities of GPHD 1.x and our plans for 2012.
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and implementation of MapReduce, a powerful tool so on.Your management wants to derive designed for deep analysis and transformation of information from both the relational data and thevery large data sets. Hadoop enables you to unstructuredexplore complex data, using custom analyses data, and wants this information as soon astailored to your information and questions. possible.Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!data to be distributed across hundreds or Hadoop is an open source project of the Apachethousands of machines forming shared nothing Foundation.clusters, and the execution of Map/Reduce It is a framework written in Java originallyroutines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after hishas its own filesystem which replicates data to sons toy elephant.multiple nodes to ensure if one node holding data Hadoop uses Google’s MapReduce and Google Filegoes down, there are at least 2 other nodes from System technologies as its foundation.which to retrieve that piece of information. This It is optimized to handle massive quantities of dataprotects the data availability from node failure, which could be structured, unstructured orsomething which is critical when there are many semi-structured, using commodity hardware, thatnodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with greatWhat is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so theThe data are stored in a relational database in your response time is not immediate.desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are nothas no problem handling this load. possible, but appends will be possible starting inThen your company starts growing very quickly, version 0.21.and that data grows to 10GB. Hadoop replicates its data across differentAnd then 100GB. computers, so that if one goes down, the data areAnd you start to reach the limits of your current processed on one of the replicated computers.desktop computer. Hadoop is not suitable for OnLine Transaction So you scale-up by investing in a larger computer, Processing workloads where data are randomly and you are then OK for a few more months. accessed on structured data like a relational When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLineAnd you are fast approaching the limits of that Analytical Processing or Decision Support Systemcomputer. workloads where data are sequentially accessed onMoreover, you are now asked to feed your structured data like a relational database, to application with unstructured data coming from generate reports that provide business sources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Pro
( EMC World 2012 ) :Apache Hadoop is now enterprise ready. This session reviews the features/roadmap of Hadoop. We will review some of the key capabilities of GPHD 1.x and our plans for 2012.
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and implementation of MapReduce, a powerful tool so on.Your management wants to derive designed for deep analysis and transformation of information from both the relational data and thevery large data sets. Hadoop enables you to unstructuredexplore complex data, using custom analyses data, and wants this information as soon astailored to your information and questions. possible.Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!data to be distributed across hundreds or Hadoop is an open source project of the Apachethousands of machines forming shared nothing Foundation.clusters, and the execution of Map/Reduce It is a framework written in Java originallyroutines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after hishas its own filesystem which replicates data to sons toy elephant.multiple nodes to ensure if one node holding data Hadoop uses Google’s MapReduce and Google Filegoes down, there are at least 2 other nodes from System technologies as its foundation.which to retrieve that piece of information. This It is optimized to handle massive quantities of dataprotects the data availability from node failure, which could be structured, unstructured orsomething which is critical when there are many semi-structured, using commodity hardware, thatnodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with greatWhat is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so theThe data are stored in a relational database in your response time is not immediate.desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are nothas no problem handling this load. possible, but appends will be possible starting inThen your company starts growing very quickly, version 0.21.and that data grows to 10GB. Hadoop replicates its data across differentAnd then 100GB. computers, so that if one goes down, the data areAnd you start to reach the limits of your current processed on one of the replicated computers.desktop computer. Hadoop is not suitable for OnLine Transaction So you scale-up by investing in a larger computer, Processing workloads where data are randomly and you are then OK for a few more months. accessed on structured data like a relational When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLineAnd you are fast approaching the limits of that Analytical Processing or Decision Support Systemcomputer. workloads where data are sequentially accessed onMoreover, you are now asked to feed your structured data like a relational database, to application with unstructured data coming from generate reports that provide business sources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Pro
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
Talk held at the Java User Group on 05.09.2013 in Novi Sad, Serbia
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
I have studied on Big Data analysis and found Hadoop is the best technology and most popular as well for it's distributed data processing approaches. I have gathered all possible information about various Hadoop distributions available in the market and tried to describe most important tools and their functionality in the Hadoop echosystems in this slide show. I have also tried to discuss about connectivity with language R interm of data analysis and visualization perspective. Hope you will be enjoying the whole!
Hadoop is an open source software framework that supports data-intensive distributed applications. Hadoop is licensed under the Apache v2 license. It is therefore generally known as Apache Hadoop. Hadoop has been developed, based on a paper originally written by Google on MapReduce system and applies concepts of functional programming. Hadoop is written in the Java programming language and is the highest-level Apache project being constructed and used by a global community of contributors. Hadoop was developed by Doug Cutting and Michael J. Cafarella. And just don't overlook the charming yellow elephant you see, which is basically named after Doug's son's toy elephant!
The topics covered in presentation are:
1. Big Data Learning Path
2.Big Data Introduction
3. Hadoop and its Eco-system
4.Hadoop Architecture
5.Next Step on how to setup Hadoop
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
YouTube Link: https://youtu.be/ll_O9JsjwT4
** Big Data Hadoop Certification Training - https://www.edureka.co/big-data-hadoop-training-certification **
This Edureka PPT on "Hadoop components" will provide you with detailed knowledge about the top Hadoop Components and it will help you understand the different categories of Hadoop Components. This PPT covers the following topics:
What is Hadoop?
Core Components of Hadoop
Hadoop Architecture
Hadoop EcoSystem
Hadoop Components in Data Storage
General Purpose Execution Engines
Hadoop Components in Database Management
Hadoop Components in Data Abstraction
Hadoop Components in Real-time Data Streaming
Hadoop Components in Graph Processing
Hadoop Components in Machine Learning
Hadoop Cluster Management tools
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
These slides cover the very basics of Hadoop architecture, in particular HDFS. This was my presentation in the first Delhi Hadoop User Group (DHUG) meetup held at Gurgaon on 10th September 2011. Loved the positive feedback. I'll also upload a more elaborate version covering Hadoop mapreduce architecture as well soon. Most of the stuff covered in these slides can be found in Tom White's book as well (See the last slide)
A presentation on Hadoop for scientific researchers given at Universitat Rovira i Virgili in Catalonia, Spain in October 2010. http://etseq.urv.cat/seminaris/seminars/3/
Hadoop is emerging as the preferred solution for big data analytics across unstructured data. Using real world examples learn how to achieve a competitive advantage by finding effective ways of analyzing new sources of unstructured and machine-generated data.
This presentation gives a high level overview of Hadoop and its eco system. It starts why Hadoop came into existence, how Hadoop is being used, what are the components of Hadoop and its eco system, who are the Hadoop and ETL/BI vendors, how Hadoop is typically implemented. It also covers a few examples to provide kick start to someone interested in learning and practicing Mapreduce, Hadoop and its ecosystem products.
Modern applications, often called “big-data” analysis, require us to manage immense amounts of data quickly. To deal with applications such as these, a new software stack has evolved.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
Talk held at the Java User Group on 05.09.2013 in Novi Sad, Serbia
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
I have studied on Big Data analysis and found Hadoop is the best technology and most popular as well for it's distributed data processing approaches. I have gathered all possible information about various Hadoop distributions available in the market and tried to describe most important tools and their functionality in the Hadoop echosystems in this slide show. I have also tried to discuss about connectivity with language R interm of data analysis and visualization perspective. Hope you will be enjoying the whole!
Hadoop is an open source software framework that supports data-intensive distributed applications. Hadoop is licensed under the Apache v2 license. It is therefore generally known as Apache Hadoop. Hadoop has been developed, based on a paper originally written by Google on MapReduce system and applies concepts of functional programming. Hadoop is written in the Java programming language and is the highest-level Apache project being constructed and used by a global community of contributors. Hadoop was developed by Doug Cutting and Michael J. Cafarella. And just don't overlook the charming yellow elephant you see, which is basically named after Doug's son's toy elephant!
The topics covered in presentation are:
1. Big Data Learning Path
2.Big Data Introduction
3. Hadoop and its Eco-system
4.Hadoop Architecture
5.Next Step on how to setup Hadoop
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
YouTube Link: https://youtu.be/ll_O9JsjwT4
** Big Data Hadoop Certification Training - https://www.edureka.co/big-data-hadoop-training-certification **
This Edureka PPT on "Hadoop components" will provide you with detailed knowledge about the top Hadoop Components and it will help you understand the different categories of Hadoop Components. This PPT covers the following topics:
What is Hadoop?
Core Components of Hadoop
Hadoop Architecture
Hadoop EcoSystem
Hadoop Components in Data Storage
General Purpose Execution Engines
Hadoop Components in Database Management
Hadoop Components in Data Abstraction
Hadoop Components in Real-time Data Streaming
Hadoop Components in Graph Processing
Hadoop Components in Machine Learning
Hadoop Cluster Management tools
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
These slides cover the very basics of Hadoop architecture, in particular HDFS. This was my presentation in the first Delhi Hadoop User Group (DHUG) meetup held at Gurgaon on 10th September 2011. Loved the positive feedback. I'll also upload a more elaborate version covering Hadoop mapreduce architecture as well soon. Most of the stuff covered in these slides can be found in Tom White's book as well (See the last slide)
A presentation on Hadoop for scientific researchers given at Universitat Rovira i Virgili in Catalonia, Spain in October 2010. http://etseq.urv.cat/seminaris/seminars/3/
Hadoop is emerging as the preferred solution for big data analytics across unstructured data. Using real world examples learn how to achieve a competitive advantage by finding effective ways of analyzing new sources of unstructured and machine-generated data.
This presentation gives a high level overview of Hadoop and its eco system. It starts why Hadoop came into existence, how Hadoop is being used, what are the components of Hadoop and its eco system, who are the Hadoop and ETL/BI vendors, how Hadoop is typically implemented. It also covers a few examples to provide kick start to someone interested in learning and practicing Mapreduce, Hadoop and its ecosystem products.
Modern applications, often called “big-data” analysis, require us to manage immense amounts of data quickly. To deal with applications such as these, a new software stack has evolved.
In this session, you get an overview of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service. We'll cover how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. We'll also discuss new features, architecture best practices, and share how customers are using Amazon Redshift for their Big Data workloads.
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
Big Data and advanced analytics are critical topics for executives today. But many still aren't sure how to turn that promise into value. This presentation provides an overview of 16 examples and use cases that lay out the different ways companies have approached the issue and found value: everything from pricing flexibility to customer preference management to credit risk analysis to fraud protection and discount targeting. For the latest on Big Data & Advanced Analytics: http://mckinseyonmarketingandsales.com/topics/big-data
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
Overview of Big data, Hadoop and Microsoft BI - version1
Big Data and Hadoop are emerging topics in data warehousing for many executives, BI practices and technologists today. However, many people still aren't sure how Big Data and existing Data warehouse can be married and turn that promise into value. This presentation provides an overview of Big Data technology and how Big Data can fit to the current BI/data warehousing context.
http://www.quantumit.com.au
http://www.evisional.com
Big Data is a collection of large and complex data sets that cannot be processed using regular database management tools or processing applications. A lot of challenges such as capture, curation, storage, search, sharing, analysis, and visualization can be encountered while handling Big Data. On the other hand the Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Big Data certification is one of the most recognized credentials of today.
For more details Click http://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
this is a presentation on hadoop basics. Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.
Asserting that Big Data is vital to business is an understatement. Organizations have generated more and more data for years, but struggle to use it effectively. Clearly Big Data has more important uses than ensuring compliance with regulatory requirements. In addition, data is being generated with greater velocity, due to the advent of new pervasive devices (e.g., smartphones, tablets, etc.), social Web sites (e.g., Facebook, Twitter, LinkedIn, etc.) and other sources like GPS, Google Maps, heat/pressure sensors, etc.
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
A guide to using Apache Hadoop as your open source big data platform of choice, including the vendors that make various Hadoop flavors, related open source tools, Hadoop capabilities and suitable applications.
Enough taking about Big data and Hadoop and let’s see how Hadoop works in action.
We will locate a real dataset, ingest it to our cluster, connect it to a database, apply some queries and data transformations on it , save our result and show it via BI tool.
1. An Overview of Big data & Hadoop
Prepared & presented by
Tony Nguyen
July 2014
2. Presentation outline
This presentation gives Big data concepts and an
overview of different Big Data technologies
Understand different tools and use the right tools for
DW and ETL
How does current BI/DW fit to the Big Data context?
How do Microsoft BI and Hadoop get married?
3. What is big data?
Refers to any collection of data sets so large
and complex i.e. hundreds of Petabytes
4. Why is Big Data concerned?
• 2 billion internet users in the world today,
• 7.3 billion active cell phones in 2014
• 7TB of data is processed by Twitter everyday
• 500TB of data is processed by Facebook everyday
• With massive quantity of data, businesses need fast,
reliable, deeper data insight
6. What is Hadoop?
refers an ecosystem which includes large
scale distributed filesystem in order to store
and process big data across multiple storage
servers.
Hadoop technologies include MapReduce &
Hadoop Distributed Filesytem (HDFS)
7. Who are the major Hadoop vendors?
IBM InfoSphere BigInsights : IBM packs Hadoop with
its products including Text analytics, Social Data
Analytics Accelerator, Big SQL, Big R
Clourera: pack Hadoop core components with its well-
known analytic SQL product named Impala and
provides enterprise support. Current Clourera Hadoop
versions includes CDH4.7 and CDH5.1
Hortonworks: a company is formed by Yahoo and
Benchmark Capital, Hortonworks makes Hadoop
ready for enterprise with the latest version of HDP 2.1
Microsoft: contributes HDInsight as Hadoop on
Windows platform
8. HDFS
The Hadoop distributed file system
(HDFS) is a distributed, scalable, and
portable file-system written in Java for the
Hadoop framework.
It is designed to run across low-cost
commodity hardware
9. MapReduce
MapReduce is a programming model and an
associated implementation for processing
and generating large data sets with a
parallel, distributed algorithm on a cluster.
From Hadoop version 2.1, Yet Another
MapReduce (YARN) was introduced.
10.
11. Core components on the top of Hadoop
1. Hive (Facebook)
2. Pig (Yahoo)
3. Hbase
4. HCatalog
5. Knox
6. ZooKeeper
7. Sqoop
12. Pig
1. Originally developed by Yahoo
2. Best used for large data set ETL
3. Dataflow scripting language called PigLatin, a High-level
language designed to remove the complexities of coding
MapReduce applications.
4. Pig converts its operators into MapReduce code.
5. Instead of needing Java programming skills and an
understanding of the MapReduce coding infrastructure,
people with little programming skills, can simply invoke
SORT or FILTER operators without having to code a
MapReduce application to accomplish those tasks.
13. Hive
Originally developed by facebook in 2007
Hive is a data warehouse built on the top of
Hadoop file system (HDFS) and allowing
developers use SQL-like scripts (called Hive
SQL or HQL) to create databases & tables.
Hive translates the SQL-like scripts into the
MapReduce algorithm to store and process large
data sets.
The short learning curve as BI developers use
familiar SQL-like scripts
14. Hive (Cont’d)
UPDATE or DELETE a record isn't allowed in Hive,
but INSERT INTO is acceptable.
A way to work around this limitation is to use
partitions: if you're getting different batches of ids
separately, you could redesign your table so that it is
partitioned by id, and then you would be able to
easily drop partitions for the ids you want to get rid
of.
15. Hbase
HBase is a column-oriented database management system that
runs on top of HDFS
The database that is modelled after Google’s BigTable
technology. HBase was created for hosting very large tables
with billions of rows and millions of columns.
An HBase system comprises a set of tables. Each table contains
rows and columns, much like a traditional database
HBase provides random, real time access to your Big Data.
Does not support a structured query language like SQL
Referred as NoSQL technology (NoSQL means Not Only SQL)
as HBase is not intended to replace your traditional RDBMS
16. HCatalog
1. HCatalog is a table and storage management layer
for Hadoop that enables users with different data
processing tools – Apache Pig, Apache MapReduce,
and Apache Hive – to more easily read and write data
on the grid
2. Frees the user from having to know where the data is
stored, with the table abstraction
3. Enables notifications of data availability
4. Provides visibility for data cleaning and archiving
tools
17. Knox
A system that provides a single point of
authentication and access for Apache Hadoop
services in a cluster. The goal of the project is
to simplify Hadoop security for users who
access the cluster data and execute jobs, and
for operators who control access and manage
the cluster.
21. Three popular open source Hadoop-based
SQL databases
1. Impala (Cloudera)
2. Stinger (Hortonworks) –(aka Hive 11, Hive
12, Hive 13 or Hive-on-Tez)
3. Presto (Facebook)
22. Impala
1. Developed by Cloudera in 2012
2. SQL query engine that runs natively in Apache Hadoop
3. Query data uses SELECT, JOIN, and aggregate
functions – in real time
4. Access directly to HDFS and use MPP computation
instead of MapReduce. Therefore, provide nearly real
time data access
5. The entire process happen on memory, therefore it
eliminates the latency of Disk IO that happen extensively
during MapReduce job.
23. MPP vs MapReduce
Both are distributed data processing systems but difference are as follows:
MPP MapReduce
used on expensive, specialized
hardware tuned for CPU, storage
and network performance
deployed to clusters of commodity
servers that in turn use commodity
disks
Faster Slower
In memory computation Disk I/O computation
Queried with SQL Java code
Declarative query Imperative code (machine code)
SQL is easier and more productive More difficult for IT processional
24. Stinger
1. Refers to new versions of Hive (versions
0.11 - 0.13) to overcome the performance
barrier of MapReduce computation
2. More SQL compliance for Hive SQL
http://hortonworks.com/labs/stinger/
26. Presto
1. Respond to Cloudera Impala, Facebook introduced
Presto in 2012
2. Presto is similar in approach to Impala in that it is
designed to provide an interactive experience whilst
still using your existing datasets stored in Hadoop.
It provides:
JDBC Drivers
ANSI-SQL syntax support (presumably ANSI-92)
A set of ‘connectors’ used to read data from existing data sources. Connectors
include: HDFS, Hive, and Cassandra.
Interop with the Hive metastore for schema sharing
28. Comparison of Hive, Impala, Presto and
Stinger
Hive Impala Presto Stinger
Year 2007 2012 Developing Developing
Orginal developer Facebook Cloudera Facebook hortonworks
Main Purpose Data warehouse Enable analysts and data
scientists to directly interact
with any data stored in
Hadoop. Offload self-service
business intelligence to
Hadoop.
RDBMS RDBMS
Computation
approach
MapReduce Massively parallel processing
(MPP) architecture
MPP MPP
Performance low fast fast fast
Latency High low latency low latency low latency
Language SQL like script ANSI-92 SQL support with
user-defined functions (UDFs)
SQL including RANK,
LEAD, LAG
SQL like script
Interfaces CLI, Web, ODBC,
JDBC
ODBC, JDBC , impala-shell,
web JDBC JDBC
High availability
Hadoop 2.0/CDH4
has HA on hdfs level
Yes
Hadoop 2.0/CDH4 has
HA on hdfs level
Hadoop 2.0/CDH4
has HA on hdfs
level
Replication Yes supported between two CDH 5
clusters
Unknown Unknown
29. Hive pros and cons
Reference: http://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/
Advantage Disadvantage
It’s been around 5 years. You could say it
is matured and proven solution.
Since it is using MapReduce, It’s carrying
all the drawbacks which MapReduce has
such as expensive shuffle phase as well
as huge IO operations
Runs on proven MapReduce framework Hive still not support multiple reducers that
make queries like Group By and Order By
lot slower
Good support for user defined functions Lot slower compare to other competitors.
It can be mapped to HBase and other
systems easily
30. Impala pros and cons
Advantage Disadvantage
Lighting speed and promise near real
time adhoc query processing.
No fault tolerance for running queries.
If a query failed on a node, the query
has to be reissued, It can’t resume
from where it fails.
The computation happen in memory,
that reduce enormous amount of
latency and Disk IO
Latest version supports UDF
Open source, Apache licensed
Reference: http://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/
31. Presto pros and cons
Advantage Disadvantage
Lighting fast and promise near real time
interactive querying.
It’s a new born baby. Need to wait and watch
since there were some interesting active
developments going on.
Used extensively in Facebook. So it is proven
and stable.
As of now support only Hive managed tables.
Though the website claim one can query
hbase also, the feature still under
development.
Open Source and there is a strong momentum
behind it ever since it’s been open sourced.
Still no UDF support yet. This is the most
requested feature to be added.
It is also using Distributed query processing
engine. So it eliminates all the latency and
DiskIO issues with traditional MapReduce.
Well documented. Perhaps this is the first open
source software from Facebook that got a
dedicated website from day 1.
Reference: http://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/
36. Comments on Impala
Among Impala, Hive and Presto, it seems that
Impala is a matured SQL in Hadoop
Impala appears to be the winner in term of
performance and matured level
38. Combining Hadoop and SQL Server tools
Both Hadoop and SQL Server have strengths and
weaknesses
Combining Hadoop and SQL Server tools will
overcome strengths and weaknesses of each
technology
39. SQL Server vs SQL on Hadoop
SQL Server SQL on Hadoop
SQL Server enforces data
quality and consistency better
(unique index, key and
foreign key)
Lack of data quality
enforcement
There is scalability limit Better for scaling and
processing massive data
40. Deployment options
Hadoop on Premise
Hadoop in the Cloud
1. Infrastructure as a Service (IAAS) – providers of IaaS
offer computers – physical or (more often) virtual
machines
2. Platform as a Service (PAAS) - including operating
system, programming language execution environment,
database, and web server.
3. Software as a service (SaaS) - provide access to
application software and databases
45. Use right ETL tools
SSIS – existing skills in organisation, need
transformation, performance tuning is impartant
Pig – use when very large data set, take advantage
of the scalability of Hadoop, IT staff is comfortable
learning a new language
Sqoop –Little need to transform the data, easy to
use, IT staff isn’t comfortable with SSIS or Pig, load
sql table directly to Hadoop.
46. SQL Server Parallel Data Warehouse –
- A high performance & expensive solution
SQL Server Parallel Data Warehouse is the MPP edition of SQL
Server.
Unlike the Standard, Enterprise or Data Center editions, PDW is
actually a hardware and software bundle rather than just a piece of
software. Microsoft call it a database "appliance".
It isn't a substitute for SSIS, SSAS and SSRS. It's Microsoft's
answer for customers needing to process 10s or 100s of terabytes
who want the ability to scale out large workloads across multiple
servers, large storage arrays and many processors.
It includes:
◦ Microsoft PolyBase
◦ Microsoft Analytics Platform System (APS)
◦ Run on the top of Hadoop
49. References
Microsoft Big Data Solutions, Wiley, February 2014
Microsoft SQL 2012 Server with Hadoop, Debarchan
Sarkar, published by Packt Publishing Ltd 2013
Cloudera.com
Hortonworks.com
Hadoop.apache.org
Microsoft.com/bigdata
Impala.io
Prestodb.io
Hive.apache.org