Big Data Introduction
Agenda
• Current Scenario/Trends in IT
• Big Data
– Batch eco system
– NoSQL eco system
– Visualization
• Case Studies for Big Data
– Enterprise Data Warehouse
– Customer Analytics
Current Scenario
Enterprise
applications
Operational
Decision
Support
Enterprise applications can be broadly categorized into
Operational and Decision support systems.
Current Scenario – Architecture
(Typical Enterprise Application)
Client
(Browser)
Client
(Browser)
Client
(Browser)
App Server
App Server
Database
Current Scenario - Architecture
• Recent trends
– Standardization and consolidation of hardware
(servers, storage, network) etc., to cut down the
costs
– Storage is physically separated from servers and
connected with high speed fiber optics
Current Scenario - Architecture
Database
Server
Database
Server
Database
Server
Network Switch Network Switch Storage Cluster
*Typical database architecture in an enterprise
Current Scenario - Architecture
• Databases
– Databases are clustered (Oracle – RAC)
• High availability
• Fault tolerance
• Load balancing
• Scalable (not linear)
– Common network storage
• File abstraction – file can be of any size
• Fault tolerance (using RAID)
Current Scenario - Architecture
• Almost all these applications follow similar n-tier architecture
– Core applications (operational)
– EAI (Integrating Enterprise Applications)
– CRM
– ERP
– DW/BI Tools like Informatica, Cognos, Business Objects etc
• However there are exceptions – legacy (Mainframes based)
applications which uses closed architecture
Current Scenario - Architecture
Application
Servers
Database Servers
Storage
Servers
*Birds eye view – after standardization and
consolidation using cloud architecture
Current Scenario - Challenges
• Almost all operational systems are using relational databases (RDBMS like
Oracle).
– RDBMS are originally designed for Operational and transactional.
• Not linearly scalable.
– Transactions
– Data integrity
• Expensive
• Predefined Schema
• Data processing do not happen where data is stored (storage layer)
– Some processing happens at database server level (SQL)
– Some processing happens at application server level (Java/.net)
– Some processing happens at client/browser level (Java Script)
Current Scenario – Use case
(E-Mail Campaigning)
App
Server(s)
Mail
Server(s)
Database
Client
Client
Client
Current Scenario – Use case
(E-Mail Campaigning)
• Customer (E-Mail recipient) data needs to be
stored in real time
• Customer data can be in hundreds of millions
(if not billions)
• For every campaign e-mail have to be pushed
to all the customers (batch and ad-hoc)
• Customers have to be uniquely identified to
avoid sending multiple coupons to same
recipient (batch and periodic)
Current Scenario – Use case
(E-Mail Campaigning)
• Challenges
– Small client vs. Big client
• Scalability issues can be significant
– Standard client vs. Premium client
– Infrastructure
• Either databases or application servers or email severs can be
bottleneck
– Code development and deployment
– Standardization
*Keep these in mind and I will explain how this can be
resolved using Big Data eco system
Big Data
• Evolution of Big Data
• Understanding characteristics of Big Data
• Batch, operational and analytics in Big Data
eco system
• Types, Technologies or tools, Techniques and
Talent
Evolution of Big Data
• GFS (Google File System)
• Google Map Reduce
• Google Big Table
Understanding characteristics of Big
Data
• Volume
• Variety
• Velocity
Batch, operational and analytics in Big
Data eco system
• Batch – Hadoop eco system
– Map reduce
– Hive/Pig
– Sqoop
• Operational (but not transactional) – NoSQL eco system
– Cassandra
– Hbase
– Mongo DB
• Analytics and visualization
– Sentiment analysis
– Statistical analysis
– Machine Learning and Natural Language Processing
Big Data eco system – Advantages
• Distributed storage
– Fault tolerance (RAID is replaced by replication)
• Distributed computing/processing
– Data locality (code goes to data)
• Scalability (almost linear)
• Low cost hardware (commodity)
• Low licensing costs
Hadoop eco system
• Evolution of Hadoop eco system
• Use cases that can be addressed using Hadoop
eco system
• Hadoop eco system tools/landscape
Evolution of Hadoop eco system
• GFS to HDFS
• Google Map Reduce to Hadoop Map Reduce
• Big Table to HBase
Use cases that can be addressed using
Hadoop eco system
• ETL
• Real time reporting
• Batch reporting
• Operational but not transactional
Hadoop eco system tools/landscape
• Operational and real time data integration
– HBase
• ETL
– Map reduce, Hive/Pig, Sqoop etc
• Reporting
– Hive (Batch)
– Impala/Presto (Real time)
• Analytics API
– Map reduce
– Other frameworks
• Miscellaneous/complementary tools
– Zoo Keeper (co-ordination service for masters)
– Oozie (Workflow/Scheduler)
– Chef/Puppet (automation for administrators)
– Vendor specific management tools (Cloudera Manager, Hortonworks Ambari etc)
NoSQL eco system
• Evolution of NoSQL eco system
• Use cases that can be addressed using NoSQL
eco system
• NoSQL eco system tools/landscape
Evolution of NoSQL eco system
• Google Big Table
• Amazon DynamoDB
• Apache HBase
• Apache Cassandra
• MongoDB
Use cases that can be addressed using
NoSQL eco system
• Operational but not transactional
• Complements conventional RDBMS systems
• NoSQL is generally not a substitute for
transactional systems.
• Facebook messenger is implemented using
HBase
NoSQL eco system tools/landscape
• NoSQL Tools
– Apache HBase
– Apache Cassandra
– MongoDB
• Miscellaneous/complementary tools
– Zoo Keeper (Co-ordination service for high
availability of masters)
– Vendor specific DevOps tools
Analytics and Visualization
• Evolution of analytics and visualization tools
• Use cases that can be addressed
– Statistical analysis
– Machine learning and Natural language processing
– Conventional Reporting
• Eco system tools/landscape
– Datameer
– Tableau or any BI tool
– R (In memory statistical analysis tool)
Use Case – E-Mail Campaigning
• Role of NoSQL
– Operational
• Role of Hadoop
– Decision support
*Both NoSQL and Hadoop can be installed on same
servers.
Current Scenario – Use case
(E-Mail Campaigning)
App
Server(s)
Mail
Server(s)
Database
Client
Client
Client
Current Scenario – Use case
(E-Mail Campaigning)
• Customer (E-Mail recipient) data needs to be
stored in real time
• Customer data can be in hundreds of millions
(if not billions)
• For every campaign e-mail have to be pushed
to all the customers (batch and ad-hoc)
• Customers have to be uniquely identified to
avoid sending multiple coupons to same
recipient (batch and periodic)
Current Scenario – Use case
(E-Mail Campaigning)
• Challenges
– Small client vs. Big client
• Scalability issues can be significant
– Standard client vs. Premium client
– Infrastructure
• Either databases or application servers or email severs can be
bottleneck
– Code development and deployment
– Standardization
*Keep these in mind and I will explain how this can be
resolved using Big Data eco system
Use Case (E-Mail Campaigning)
Big Data eco system
Client
Client
Client
StorageProcessing
StorageProcessing
Node 1
Node 2
Use Case (E-Mail Campaigning)
Big Data eco system
• Storage
– Distributed Storage (example HDFS, CFS, GFS etc)
• Processing
– Operational (HBase, Cassandra)
• Data storage is operational – for example customers might have to
stored in real time
– Batch (Map Reduce, Hive/Pig)
• E-Mail campaigning is batch
• Map Reduce can be integrated with E-Mail notification to push the
campaigning.
• Customer validation can be done in batch
Use Case (LinkedIn)
• Most of the frames in linkedin.com are
implemented using Big Data eco system tools
• Advantages
– Low cost to implement an idea (endorsements)
– No impact on existing applications
– Both operational (actual endorsement) and batch
(consolidated e-mail) are done on same servers
– Distributed and scalable
Use Case – EDW
(Current Architecture)
OLTP
Closed
Main Frames
XML
External apps
Data
Warehouse
Data Integration
(ETL/Real Time)
ODS
Source(s)
EDW/ODS
Visualization/
Reporting
Reporting
Decision
Support
Use Case – EDW
(Current Architecture)
• Enterprise Data Warehouse is built for Enterprise reporting for selected
audience in Executive Management, hence user base who view the reports
will be typically in tens or hundreds
• Data Integration
– ODS (Operational Data Store)
• Sources – Disparate
• Real time – Tools/custom (Goldengate, Shareplex etc)
• Batch – Tools/custom
• Uses – Compliance, data lineage, reports etc
– Enterprise Datawarehouse
• Sources – ODS or other sources
• ETL – Tools/custom (Informatica, Ab Initio, Talend)
• Reporting/Visualization
– ODS (Compliance related reporting)
– Enterprise Datawarehouse
– Tools (Cognos, Business Objects, Microstrategy, Tableau etc)
Use Case – EDW
(Big Data eco system)
OLTP
Closed
Main Frames
XML
External apps
Source(s)
Visualization/
Reporting
Reporting
Decision
Support
Node
Node
Node
Hadoop Cluster
(EDW/ODS)
ETL
Real
Time/Batch
(No ETL)
Reporting
Database
Hadoop eco system
Distributed File System (HDFS)
Map Reduce
Hadoop Core Components
Hive
Pig
Flume
Non Map Reduce
Impala
PrestoSqoop
Oozie
Mahout
Hadoop eco system
Hadoop Components
HBase
Use Case – EDW
(Big Data eco system)
• ODS and EDW can be shared on the same hadoop cluster
• Real time/batch data integration
– Flume (to get data from web logs)
– Use HBase layer
• ETL
– Should leverage Hadoop Map Reduce capabilities
– Sqoop – to get data from relational databases
– Hive/Pig – To process/transform data as per reporting requirements
• Reporting/Visualization
– Reporting can be done either directly from Hadoop or separate
reporting database
Use Case – EDW
(Big Data eco system)
• Pros over traditional EDW
– Low cost and consolidated hardware
– Low licensing costs
– Open source tools
– Facilitate advanced analytics
• Cons over traditional EDW
– Still evolving
– Learning curve
Use Case – Customer Analytics
• A company can often have thousands to millions
of customer (eg: eBay, Amazon, YouTube,
LinkedIn etc.)
• Analytics at customer level can add significant
value to both customer as well as Enterprise
• Traditional EDW appliances will not be able to
support customer analytics/reporting for large
enterprises
• Big Data eco system of tools can handle customer
analytics for an enterprise of any size
Hadoop eco system
Distributed File System (HDFS)
Map Reduce
Hadoop Core Components
Hive
Pig
Flume
Non Map Reduce
Impala
PrestoSqoop
Oozie
Mahout
Hadoop eco system
Hadoop Components
HBase
Use Case – Customer Analytics
• Capture data from web logs and load into
Hadoop – Flume/custom solution
• Load customer profile data from traditional MDM
or EDW or other source to Hadoop –
Sqoop/Hive/HBase
• Perform ETL to compute analytics at customer
level – Hive/Pig
• Database to store the pre-computed analytics for
all customers – Hbase
• Visualization – is often custom as per company's
requirements
Jobs in Big Data
• Generalized
– Data Scientists
– Solutions Architects
– Infrastructure Architects
– And many more
• Specialized
– ETL developers/architects
– Advanced analytics developers/architects
– Data Analysts/Business Analysts
– Hadoop Admins
– NoSQL Admins/DBAs
– Devops Engineers
– And many more
Industry reaction
• Oracle – Big Data appliance
• IBM – Big Insights
• EMC created PivotalHD
• ETL tools – Informatica, Syncsort etc are
adding or rearchitecting big data capabilities
Hadoop eco system
Distributed File System (HDFS)
Map Reduce
Hadoop Core Components
Hive
Pig
Flume
Non Map Reduce
Impala
PrestoSqoop
Oozie
Mahout
Hadoop eco system
Hadoop Components
HBase
Thank You
Hadoop eco system - Setup
Environment
• https://www.youtube.com/watch?v=p7uCyFf
WL-
c&index=13&list=PLf0swTFhTI8o6LURHy7u3YI
BC3FnuZIdP
HDFS
• Hadoop Distributed File System
– It is a distributed storage
– HDFS files are logical (physically they are stored as
blocks)
– Replication factor is used for fault tolerance.
https://www.youtube.com/watch?v=-Rc-
jisdyKI&index=14&list=PLf0swTFhTI8o6LURHy7u
3YIBC3FnuZIdP
Map Reduce
• https://www.youtube.com/watch?v=IRxgew6y
tq8&index=16&list=PLf0swTFhTI8o6LURHy7u3
YIBC3FnuZIdP
• https://www.youtube.com/watch?v=he8vt835
cf8&index=17&list=PLf0swTFhTI8o6LURHy7u3
YIBC3FnuZIdP
Map Reduce based tools
• Hive
• Pig
• Sqoop
• Oozie
• Flume
• Many more
Non Map reduce based tools
• Impala
• HBase
• Many more
Introduction to Hive
• Define logical structure on top of data in HDFS
• Provides commands to load/insert data into HDFS
• Provides SQL interface to process data in HDFS
• It typically uses map reduce to process the data
• Stores metadata/logical structure in traditional
RDBMS such as MySQL
• A small demo on Hive
Introduction to Pig
• Another map reduce based interface to process
data in HDFS
• Provides commands to load and read data from
HDFS
• No need to have pre-defined structure on data
• No need of rigid schemas
• Handy to process unstructured or semi-
structured data
• A small demo on Pig
Introduction to Sqoop
• Map reduce based data copying utility to and
from HDFS
• It can understand HCatalog/Hive structure
• It can copy data from almost all traditional
RDBMS, EDW appliances as well as NoSQL
data stores to HDFS and vice-versa
• A small demo on Sqoop

Big Data Introduction

  • 1.
  • 2.
    Agenda • Current Scenario/Trendsin IT • Big Data – Batch eco system – NoSQL eco system – Visualization • Case Studies for Big Data – Enterprise Data Warehouse – Customer Analytics
  • 3.
    Current Scenario Enterprise applications Operational Decision Support Enterprise applicationscan be broadly categorized into Operational and Decision support systems.
  • 4.
    Current Scenario –Architecture (Typical Enterprise Application) Client (Browser) Client (Browser) Client (Browser) App Server App Server Database
  • 5.
    Current Scenario -Architecture • Recent trends – Standardization and consolidation of hardware (servers, storage, network) etc., to cut down the costs – Storage is physically separated from servers and connected with high speed fiber optics
  • 6.
    Current Scenario -Architecture Database Server Database Server Database Server Network Switch Network Switch Storage Cluster *Typical database architecture in an enterprise
  • 7.
    Current Scenario -Architecture • Databases – Databases are clustered (Oracle – RAC) • High availability • Fault tolerance • Load balancing • Scalable (not linear) – Common network storage • File abstraction – file can be of any size • Fault tolerance (using RAID)
  • 8.
    Current Scenario -Architecture • Almost all these applications follow similar n-tier architecture – Core applications (operational) – EAI (Integrating Enterprise Applications) – CRM – ERP – DW/BI Tools like Informatica, Cognos, Business Objects etc • However there are exceptions – legacy (Mainframes based) applications which uses closed architecture
  • 9.
    Current Scenario -Architecture Application Servers Database Servers Storage Servers *Birds eye view – after standardization and consolidation using cloud architecture
  • 10.
    Current Scenario -Challenges • Almost all operational systems are using relational databases (RDBMS like Oracle). – RDBMS are originally designed for Operational and transactional. • Not linearly scalable. – Transactions – Data integrity • Expensive • Predefined Schema • Data processing do not happen where data is stored (storage layer) – Some processing happens at database server level (SQL) – Some processing happens at application server level (Java/.net) – Some processing happens at client/browser level (Java Script)
  • 11.
    Current Scenario –Use case (E-Mail Campaigning) App Server(s) Mail Server(s) Database Client Client Client
  • 12.
    Current Scenario –Use case (E-Mail Campaigning) • Customer (E-Mail recipient) data needs to be stored in real time • Customer data can be in hundreds of millions (if not billions) • For every campaign e-mail have to be pushed to all the customers (batch and ad-hoc) • Customers have to be uniquely identified to avoid sending multiple coupons to same recipient (batch and periodic)
  • 13.
    Current Scenario –Use case (E-Mail Campaigning) • Challenges – Small client vs. Big client • Scalability issues can be significant – Standard client vs. Premium client – Infrastructure • Either databases or application servers or email severs can be bottleneck – Code development and deployment – Standardization *Keep these in mind and I will explain how this can be resolved using Big Data eco system
  • 14.
    Big Data • Evolutionof Big Data • Understanding characteristics of Big Data • Batch, operational and analytics in Big Data eco system • Types, Technologies or tools, Techniques and Talent
  • 15.
    Evolution of BigData • GFS (Google File System) • Google Map Reduce • Google Big Table
  • 16.
    Understanding characteristics ofBig Data • Volume • Variety • Velocity
  • 17.
    Batch, operational andanalytics in Big Data eco system • Batch – Hadoop eco system – Map reduce – Hive/Pig – Sqoop • Operational (but not transactional) – NoSQL eco system – Cassandra – Hbase – Mongo DB • Analytics and visualization – Sentiment analysis – Statistical analysis – Machine Learning and Natural Language Processing
  • 18.
    Big Data ecosystem – Advantages • Distributed storage – Fault tolerance (RAID is replaced by replication) • Distributed computing/processing – Data locality (code goes to data) • Scalability (almost linear) • Low cost hardware (commodity) • Low licensing costs
  • 19.
    Hadoop eco system •Evolution of Hadoop eco system • Use cases that can be addressed using Hadoop eco system • Hadoop eco system tools/landscape
  • 20.
    Evolution of Hadoopeco system • GFS to HDFS • Google Map Reduce to Hadoop Map Reduce • Big Table to HBase
  • 21.
    Use cases thatcan be addressed using Hadoop eco system • ETL • Real time reporting • Batch reporting • Operational but not transactional
  • 22.
    Hadoop eco systemtools/landscape • Operational and real time data integration – HBase • ETL – Map reduce, Hive/Pig, Sqoop etc • Reporting – Hive (Batch) – Impala/Presto (Real time) • Analytics API – Map reduce – Other frameworks • Miscellaneous/complementary tools – Zoo Keeper (co-ordination service for masters) – Oozie (Workflow/Scheduler) – Chef/Puppet (automation for administrators) – Vendor specific management tools (Cloudera Manager, Hortonworks Ambari etc)
  • 23.
    NoSQL eco system •Evolution of NoSQL eco system • Use cases that can be addressed using NoSQL eco system • NoSQL eco system tools/landscape
  • 24.
    Evolution of NoSQLeco system • Google Big Table • Amazon DynamoDB • Apache HBase • Apache Cassandra • MongoDB
  • 25.
    Use cases thatcan be addressed using NoSQL eco system • Operational but not transactional • Complements conventional RDBMS systems • NoSQL is generally not a substitute for transactional systems. • Facebook messenger is implemented using HBase
  • 26.
    NoSQL eco systemtools/landscape • NoSQL Tools – Apache HBase – Apache Cassandra – MongoDB • Miscellaneous/complementary tools – Zoo Keeper (Co-ordination service for high availability of masters) – Vendor specific DevOps tools
  • 27.
    Analytics and Visualization •Evolution of analytics and visualization tools • Use cases that can be addressed – Statistical analysis – Machine learning and Natural language processing – Conventional Reporting • Eco system tools/landscape – Datameer – Tableau or any BI tool – R (In memory statistical analysis tool)
  • 28.
    Use Case –E-Mail Campaigning • Role of NoSQL – Operational • Role of Hadoop – Decision support *Both NoSQL and Hadoop can be installed on same servers.
  • 29.
    Current Scenario –Use case (E-Mail Campaigning) App Server(s) Mail Server(s) Database Client Client Client
  • 30.
    Current Scenario –Use case (E-Mail Campaigning) • Customer (E-Mail recipient) data needs to be stored in real time • Customer data can be in hundreds of millions (if not billions) • For every campaign e-mail have to be pushed to all the customers (batch and ad-hoc) • Customers have to be uniquely identified to avoid sending multiple coupons to same recipient (batch and periodic)
  • 31.
    Current Scenario –Use case (E-Mail Campaigning) • Challenges – Small client vs. Big client • Scalability issues can be significant – Standard client vs. Premium client – Infrastructure • Either databases or application servers or email severs can be bottleneck – Code development and deployment – Standardization *Keep these in mind and I will explain how this can be resolved using Big Data eco system
  • 32.
    Use Case (E-MailCampaigning) Big Data eco system Client Client Client StorageProcessing StorageProcessing Node 1 Node 2
  • 33.
    Use Case (E-MailCampaigning) Big Data eco system • Storage – Distributed Storage (example HDFS, CFS, GFS etc) • Processing – Operational (HBase, Cassandra) • Data storage is operational – for example customers might have to stored in real time – Batch (Map Reduce, Hive/Pig) • E-Mail campaigning is batch • Map Reduce can be integrated with E-Mail notification to push the campaigning. • Customer validation can be done in batch
  • 34.
    Use Case (LinkedIn) •Most of the frames in linkedin.com are implemented using Big Data eco system tools • Advantages – Low cost to implement an idea (endorsements) – No impact on existing applications – Both operational (actual endorsement) and batch (consolidated e-mail) are done on same servers – Distributed and scalable
  • 35.
    Use Case –EDW (Current Architecture) OLTP Closed Main Frames XML External apps Data Warehouse Data Integration (ETL/Real Time) ODS Source(s) EDW/ODS Visualization/ Reporting Reporting Decision Support
  • 36.
    Use Case –EDW (Current Architecture) • Enterprise Data Warehouse is built for Enterprise reporting for selected audience in Executive Management, hence user base who view the reports will be typically in tens or hundreds • Data Integration – ODS (Operational Data Store) • Sources – Disparate • Real time – Tools/custom (Goldengate, Shareplex etc) • Batch – Tools/custom • Uses – Compliance, data lineage, reports etc – Enterprise Datawarehouse • Sources – ODS or other sources • ETL – Tools/custom (Informatica, Ab Initio, Talend) • Reporting/Visualization – ODS (Compliance related reporting) – Enterprise Datawarehouse – Tools (Cognos, Business Objects, Microstrategy, Tableau etc)
  • 37.
    Use Case –EDW (Big Data eco system) OLTP Closed Main Frames XML External apps Source(s) Visualization/ Reporting Reporting Decision Support Node Node Node Hadoop Cluster (EDW/ODS) ETL Real Time/Batch (No ETL) Reporting Database
  • 38.
    Hadoop eco system DistributedFile System (HDFS) Map Reduce Hadoop Core Components Hive Pig Flume Non Map Reduce Impala PrestoSqoop Oozie Mahout Hadoop eco system Hadoop Components HBase
  • 39.
    Use Case –EDW (Big Data eco system) • ODS and EDW can be shared on the same hadoop cluster • Real time/batch data integration – Flume (to get data from web logs) – Use HBase layer • ETL – Should leverage Hadoop Map Reduce capabilities – Sqoop – to get data from relational databases – Hive/Pig – To process/transform data as per reporting requirements • Reporting/Visualization – Reporting can be done either directly from Hadoop or separate reporting database
  • 40.
    Use Case –EDW (Big Data eco system) • Pros over traditional EDW – Low cost and consolidated hardware – Low licensing costs – Open source tools – Facilitate advanced analytics • Cons over traditional EDW – Still evolving – Learning curve
  • 41.
    Use Case –Customer Analytics • A company can often have thousands to millions of customer (eg: eBay, Amazon, YouTube, LinkedIn etc.) • Analytics at customer level can add significant value to both customer as well as Enterprise • Traditional EDW appliances will not be able to support customer analytics/reporting for large enterprises • Big Data eco system of tools can handle customer analytics for an enterprise of any size
  • 42.
    Hadoop eco system DistributedFile System (HDFS) Map Reduce Hadoop Core Components Hive Pig Flume Non Map Reduce Impala PrestoSqoop Oozie Mahout Hadoop eco system Hadoop Components HBase
  • 43.
    Use Case –Customer Analytics • Capture data from web logs and load into Hadoop – Flume/custom solution • Load customer profile data from traditional MDM or EDW or other source to Hadoop – Sqoop/Hive/HBase • Perform ETL to compute analytics at customer level – Hive/Pig • Database to store the pre-computed analytics for all customers – Hbase • Visualization – is often custom as per company's requirements
  • 44.
    Jobs in BigData • Generalized – Data Scientists – Solutions Architects – Infrastructure Architects – And many more • Specialized – ETL developers/architects – Advanced analytics developers/architects – Data Analysts/Business Analysts – Hadoop Admins – NoSQL Admins/DBAs – Devops Engineers – And many more
  • 45.
    Industry reaction • Oracle– Big Data appliance • IBM – Big Insights • EMC created PivotalHD • ETL tools – Informatica, Syncsort etc are adding or rearchitecting big data capabilities
  • 46.
    Hadoop eco system DistributedFile System (HDFS) Map Reduce Hadoop Core Components Hive Pig Flume Non Map Reduce Impala PrestoSqoop Oozie Mahout Hadoop eco system Hadoop Components HBase
  • 47.
  • 48.
    Hadoop eco system- Setup Environment • https://www.youtube.com/watch?v=p7uCyFf WL- c&index=13&list=PLf0swTFhTI8o6LURHy7u3YI BC3FnuZIdP
  • 49.
    HDFS • Hadoop DistributedFile System – It is a distributed storage – HDFS files are logical (physically they are stored as blocks) – Replication factor is used for fault tolerance. https://www.youtube.com/watch?v=-Rc- jisdyKI&index=14&list=PLf0swTFhTI8o6LURHy7u 3YIBC3FnuZIdP
  • 50.
    Map Reduce • https://www.youtube.com/watch?v=IRxgew6y tq8&index=16&list=PLf0swTFhTI8o6LURHy7u3 YIBC3FnuZIdP •https://www.youtube.com/watch?v=he8vt835 cf8&index=17&list=PLf0swTFhTI8o6LURHy7u3 YIBC3FnuZIdP
  • 51.
    Map Reduce basedtools • Hive • Pig • Sqoop • Oozie • Flume • Many more
  • 52.
    Non Map reducebased tools • Impala • HBase • Many more
  • 53.
    Introduction to Hive •Define logical structure on top of data in HDFS • Provides commands to load/insert data into HDFS • Provides SQL interface to process data in HDFS • It typically uses map reduce to process the data • Stores metadata/logical structure in traditional RDBMS such as MySQL • A small demo on Hive
  • 54.
    Introduction to Pig •Another map reduce based interface to process data in HDFS • Provides commands to load and read data from HDFS • No need to have pre-defined structure on data • No need of rigid schemas • Handy to process unstructured or semi- structured data • A small demo on Pig
  • 55.
    Introduction to Sqoop •Map reduce based data copying utility to and from HDFS • It can understand HCatalog/Hive structure • It can copy data from almost all traditional RDBMS, EDW appliances as well as NoSQL data stores to HDFS and vice-versa • A small demo on Sqoop