Analytics using big data technologies

•Download as PPTX, PDF•

0 likes•253 views

Overview of big data technologies like Hadoop, Hive, Pig, HDFS, Map Reduce, Spark and example architectures for designing big data products and platforms.

Data & Analytics

Balakrishnan Vinchu
Analytics using big data technologies

Agenda
• Big data eco system
• Architecture of Hadoop and spark
• Technology details of big data eco system
• Lambda architecture
• Big data architecture principles
• Use cases examples

Hadoop
Distributed storage and processing system
Designed for Scalability
Hadoop 1.0 – HDFS and MapReduce
Hadoop 2.0 – 1.0 + Yarn resource management
Latest version – 3.1.1

4V of big data
Volume Variety
Veracity Velocity

Hadoop 2.0 Ecosystem
Credit: Data Flair
Credit: HDP

Hadoop - HDFS
Distributed file system
Name node
Data node
Forms the basis for Hadoop eco system
Low latency data access
Blocks replicated across nodes for fault tolerance and high availability

Hadoop – HDFS
• Name node
• Data node
• Block replication
• HDFS APIs
Credit: Data Flair

Hadoop - Yarn
• Resource management layer
introduced from 2.x
• Scheduler
• Application manager
• Node manager
Credit: DataFlair

Hadoop - MapReduce
• Parallel processing layer for
Hadoop
• Map stage
• Reduce stage

Hbase
• Column oriented database top of
HDFS
• Region server
• HMaster
• Suited for sparse datasets
Credit: Edureka

Meta Table
Region Server
• Hbase catalog table
• WAL
• Block cache
• Memstore
• HFile
Credit: Edureka

Hive
DATAWAREHOUSE DATA SUMMARIZATION SQL QUERIES

Hive architecture
• Hive Clients
• Hive Services
• Processing frameworks
Credit: Edureka

Hive Data Model
Table
Managed Table
• CREATE TABLE <table_name>
(column1 data_type, column2
data_type);
External Table
• CREATE EXTERNAL TABLE <Table
Name> (column1 data_type,
colume2 data_type);
Partition
CREATE TABLE table_name
(column1 data_type, column2
data_type) PARTITIONED BY
(partition1 data_type, partition2
data_type,….);
Bucket
CREATE TABLE table_name
PARTITIONED BY (partition1
data_type, partition2 data_type,….)
CLUSTERED BY (column_name1,
column_name2, …) SORTED BY
(column_name [ASC|DESC], …)]
INTO num_buckets BUCKETS;

Pig
• High-level data flow language
• Reduce code, Increase
development
• Built operators – Join, Filters,
Order, Sort
• Data types – Tuple, bags, maps

Flume
• Real-time data collection
• Guarantee data delivery
• Many input sources –
Avro, Thrift, Exec, Kafka,
HTTP, Custom
• Many output sources –
HDFS, Hive, Logger,
Hbase, ESR, Kafka,
Custom

Sqoop
• Batch transfer of data from
RDBMS to Hadoop eco
system and Viz.
• MapReduce code for
import/export

Mahout
Features
Collaborative
filtering
Clustering Classification
Frequent
itemset
mining
Scalable library for machine
learning algorithms

Spark
• Speed
• Ease of use
• Stream, Graph, SQL, ML
• Runs everywhere
• Support & Community

Technology challenges
What tools to
use?
Is there
reference?
Is there support? How to use?

Technology trend
CPUs not fast
enough
GPUs costly
and not fit for
all problems
Elastic cloud
Open eco
system
Batch
computations
Serialization
protocols
Random
access NoSql
databases
Message
queues
Realtime
computations

Lambda
Architecture
Batch Layer
Serving Layer
Speed Layer
Batch views
Realtime views
Master dataset

Lambda architecture
Batch view = function (all data)
Realtime view = function (Realtime view, new data)
Query = function ( batch view, Realtime view)

Data First Principle
Data collection is not
dependent on application
Build dynamic applications Allows flexibility

Fact based
model
Information Data Queries
Views Deconstruct data into
Immutable units
Identifiable

Simplified Big data process
Collect Store Process Analytics

Example - big data
architecture
Ingestion
Framework
ETL/DPF
Primitives
HDP Cluster

Clinical data lake – Another example of big data

Data lake
technology
stacks
• On-prem, multi-cloud, hybrid, On-cloud
Deployment
• HDFS, OS, Hive, Hbase, ESR
Data storage along with retention
• MR, Hive, Spark, Drill, NiFi, Beam
Data lake processing
• Encryption, IAM, GDPR
Data governance
• Qlik, Tableau, JDBC, REST
Advanced analytics

What's hot

Getting started big dataKibrom Gebrehiwot

Intro to Apache SparkMarius Soutier

Aziksa hadoop architecture santosh jhaData Con LA

Apache drillMapR Technologies

Hadoop And Their Ecosystemsunera pathan

Concepts on HadoopChristopher Sharkey

HadoopKasam Sharif

Apache Spark BriefingThomas W. Dinsmore

Basic Hadoop Architecture V1 vs V2VIVEKVANAVAN

HUG_Ireland_Apache_Arrow_Tomer_Shiran John Mulhall

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

Hadoop ABHIJEET RAJ

HadoopYojana Nanaware

Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson

Apache hadoop technology : BeginnersShweta Patnaik

Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...Spark Summit

Indexing with solr search server and hadoop frameworkkeval dalasaniya

Hadoop and Distributed ComputingFederico Cargnelutti

سکوهای ابری و مدل های برنامه نویسی در ابرdatastack

What's hot (19)

Getting started big data

Intro to Apache Spark

Aziksa hadoop architecture santosh jha

Apache drill

Hadoop And Their Ecosystem

Concepts on Hadoop

Hadoop

Apache Spark Briefing

Basic Hadoop Architecture V1 vs V2

HUG_Ireland_Apache_Arrow_Tomer_Shiran

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Hadoop

Large-Scale Data Science on Hadoop (Intel Big Data Day)

Apache hadoop technology : Beginners

Solving Real Problems with Apache Spark: Archiving, E-Discovery, and Supervis...

Indexing with solr search server and hadoop framework

Hadoop and Distributed Computing

سکوهای ابری و مدل های برنامه نویسی در ابر

Similar to Analytics using big data technologies

Impala for PhillyDB MeetupShravan (Sean) Pabba

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw

Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw

Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...inside-BigData.com

Scaling Storage and Computation with Hadoopyaevents

Big data Hadoop Ayyappan Paramesh

9.-dados e processamento distribuido-hadoop.pdfManoel Ribeiro

HADOOP TECHNOLOGY pptsravya raju

Hadoop distributions - ecosystemJakub Stransky

Big Data and Cloud ComputingFarzad Nozarian

Big Data and Hadoop Training in ChandigarhBig Boxx Animation Academy

Technologies for Data Analytics PlatformN Masahiro

Introduction to Hadoop and Big DataJoe Alex

Big Data in the Microsoft PlatformJesus Rodriguez

Bi with apache hadoop(en)Alexander Alten-Lorenz

Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow

An Introduction of Apache HadoopKMS Technology

Predictive Analytics and Machine Learning…with SAS and Apache HadoopHortonworks

Similar to Analytics using big data technologies (20)

Impala for PhillyDB Meetup

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3

Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2

Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...

Scaling Storage and Computation with Hadoop

Big data Hadoop

9.-dados e processamento distribuido-hadoop.pdf

HADOOP TECHNOLOGY ppt

Hadoop distributions - ecosystem

Big Data and Cloud Computing

Big Data and Hadoop Training in Chandigarh

Technologies for Data Analytics Platform

Introduction to Hadoop and Big Data

Big Data in the Microsoft Platform

Bi with apache hadoop(en)

Big Data Developers Moscow Meetup 1 - sql on hadoop

An Introduction of Apache Hadoop

Predictive Analytics and Machine Learning…with SAS and Apache Hadoop

Recently uploaded

Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083

VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375

RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth

Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha

Industrialised data - the key to AI success.pdfLars Albertsson

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

Recently uploaded (20)

Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...

VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...

RA-11058_IRR-COMPRESS Do 198 series of 1998

Customer Service Analytics - Make Sense of All Your Data.pptx

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...

04242024_CCC TUG_Joins and Relationships

100-Concepts-of-AI by Anupama Kate .pptx

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Unveiling Insights: The Role of a Data Analyst

Log Analysis using OSSEC sasoasasasas.pptx

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...

Industrialised data - the key to AI success.pdf

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

Analytics using big data technologies

1. Balakrishnan Vinchu Analytics using big data technologies

2. Agenda • Big data eco system • Architecture of Hadoop and spark • Technology details of big data eco system • Lambda architecture • Big data architecture principles • Use cases examples

3. Hadoop Distributed storage and processing system Designed for Scalability Hadoop 1.0 – HDFS and MapReduce Hadoop 2.0 – 1.0 + Yarn resource management Latest version – 3.1.1

4. Apache Hadoop

5. Evolution Batch Stream Machine learning

6. 4V of big data Volume Variety Veracity Velocity

8. Hadoop 2.0 Ecosystem Credit: Data Flair Credit: HDP

9. Big data eco system

10. Hadoop - HDFS Distributed file system Name node Data node Forms the basis for Hadoop eco system Low latency data access Blocks replicated across nodes for fault tolerance and high availability

11. Credit: Edureka

12. Hadoop – HDFS • Name node • Data node • Block replication • HDFS APIs Credit: Data Flair

13. Credit: Edureka

14. Credit: Edureka

15. Hadoop - Yarn • Resource management layer introduced from 2.x • Scheduler • Application manager • Node manager Credit: DataFlair

16. Hadoop - MapReduce • Parallel processing layer for Hadoop • Map stage • Reduce stage

17. Hbase • Column oriented database top of HDFS • Region server • HMaster • Suited for sparse datasets Credit: Edureka

18. Meta Table Region Server • Hbase catalog table • WAL • Block cache • Memstore • HFile Credit: Edureka

19. Hive DATAWAREHOUSE DATA SUMMARIZATION SQL QUERIES

20. Hive architecture • Hive Clients • Hive Services • Processing frameworks Credit: Edureka

21. Hive Data Model Table Managed Table • CREATE TABLE <table_name> (column1 data_type, column2 data_type); External Table • CREATE EXTERNAL TABLE <Table Name> (column1 data_type, colume2 data_type); Partition CREATE TABLE table_name (column1 data_type, column2 data_type) PARTITIONED BY (partition1 data_type, partition2 data_type,….); Bucket CREATE TABLE table_name PARTITIONED BY (partition1 data_type, partition2 data_type,….) CLUSTERED BY (column_name1, column_name2, …) SORTED BY (column_name [ASC|DESC], …)] INTO num_buckets BUCKETS;

22. Hive Partitions Credit: Edureka

23. Pig • High-level data flow language • Reduce code, Increase development • Built operators – Join, Filters, Order, Sort • Data types – Tuple, bags, maps

24. Flume • Real-time data collection • Guarantee data delivery • Many input sources – Avro, Thrift, Exec, Kafka, HTTP, Custom • Many output sources – HDFS, Hive, Logger, Hbase, ESR, Kafka, Custom

25. Sqoop • Batch transfer of data from RDBMS to Hadoop eco system and Viz. • MapReduce code for import/export

26. Oozie Jobs management Jobs scheduler

27. Mahout Features Collaborative filtering Clustering Classification Frequent itemset mining Scalable library for machine learning algorithms

28. Spark • Speed • Ease of use • Stream, Graph, SQL, ML • Runs everywhere • Support & Community

29. Technology challenges What tools to use? Is there reference? Is there support? How to use?

30. Technology trend CPUs not fast enough GPUs costly and not fit for all problems Elastic cloud Open eco system Batch computations Serialization protocols Random access NoSql databases Message queues Realtime computations

31. Lambda Architecture Batch Layer Serving Layer Speed Layer Batch views Realtime views Master dataset

32. Lambda architecture Batch view = function (all data) Realtime view = function (Realtime view, new data) Query = function ( batch view, Realtime view)

33. Data First Principle Data collection is not dependent on application Build dynamic applications Allows flexibility

34. Fact based model Information Data Queries Views Deconstruct data into Immutable units Identifiable

35. Simplified Big data process Collect Store Process Analytics

36. Example - big data architecture Ingestion Framework ETL/DPF Primitives HDP Cluster

37. Clinical data lake – Another example of big data

38. Data lake technology stacks • On-prem, multi-cloud, hybrid, On-cloud Deployment • HDFS, OS, Hive, Hbase, ESR Data storage along with retention • MR, Hive, Spark, Drill, NiFi, Beam Data lake processing • Encryption, IAM, GDPR Data governance • Qlik, Tableau, JDBC, REST Advanced analytics

39. Machine learning on big data infra

Editor's Notes

The term “BIG DATA” attracts many because it raises many questions like what it is? how to use? How it is beneficial? We will discuss broad spectrum of technologies like Hadoop, MapReduce, Spark, Hive, Pig, Hbase, AWS, Azure, S3, Lambda Architecture, current technology trends, current status of big data companies like HDP, MapR, cloudera, big data design principles, real world big data solutions. We will start with what is Hadoop, how this is evolved? What are the tools available around Hadoop? Architecture of Hadoop and spark. Technology deep dive with details on each eco system components like HDFS, MR, Hive, PIG, Hbase, Spark, Yarn.. Summaries each components role, how to use them, when to use them, pros and cons. While doing so, explain the problems and best practices around these technologies. Current technology trend and where the world is moving for big data. Explain the cloud technologies like AWS, Azure and its impact on the big data. Deep dive into Lambda architecture how to use this architecture for big data. Deep dive demo sessions on big data processing using Spark. Also we will explain the big data architecture principles, design, best practices of real world big data systems built using these technologies. Intended for Beginners and technology enthusiast who wants to understand big data ecosystem.
Hadoop is a software library for distributed processing of large datasets across clusters of computers using simple programming paradigm. Scale out Distributed storage Distributed processing Eventual consistency
In 2010, Facebook claimed to have one of the largest HDFS cluster storing 21 Petabytes of data. In 2012, Facebook declared that they have the largest single HDFS cluster with more than 100 PB of data. And Yahoo! has more than 100,000 CPU in over 40,000 servers running Hadoop, with its biggest Hadoop cluster running 4,500 nodes. All told, Yahoo! stores 455 petabytes of data in HDFS. In fact, by 2013, most of the big names in the Fortune 50 started using Hadoop. The base Apache Hadoop framework is composed of the following modules: Hadoop Common – contains libraries and utilities needed by other Hadoop modules. Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications. Hadoop MapReduce – a programming model for large scale data processing.
Apache Hive – Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for large datasets stored in HDFS. Apache Pig – A platform for processing and analyzing large data sets. Pig consists of a high-level language (Pig Latin) for expressing data analysis programs paired with the MapReduce framework for processing these programs. MapReduce – MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner. Apache Spark – Spark is ideal for in-memory data processing. It allows data scientists to implement fast, iterative algorithms for advanced analytics such as clustering and classification of datasets. Apache Storm – Storm is a distributed real-time computation system for processing fast, large streams of data adding reliable real-time data processing capabilities to Apache Hadoop 2.x Apache HBase – A column-oriented NoSQL data storage system that provides random real-time read/write access to big data for user applications. Apache Tez – Tez generalizes the MapReduce paradigm to a more powerful framework for executing a complex DAG (directed acyclic graph) of tasks for near real-time big data processing. Apache Kafka – Kafka is a fast and scalable publish-subscribe messaging system that is often used in place of traditional message brokers because of its higher throughput, replication, and fault tolerance. Apache HCatalog – A table and metadata management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop. Apache Slider – A framework for deployment of long-running data access applications in Hadoop. Slider leverages YARN’s resource management capabilities to deploy those applications, to manage their lifecycles and scale them up or down. Apache Solr – Solr is the open source platform for searches of data stored in Hadoop. Solr enables powerful full-text search and near real-time indexing on many of the world’s largest Internet sites. Apache Mahout – Mahout provides scalable machine learning algorithms for Hadoop which aids with data science for clustering, classification and batch based collaborative filtering. Apache Accumulo – Accumulo is a high performance data storage and retrieval system with cell-level access control. It is a scalable implementation of Google’s Big Table design that works on top of Apache Hadoop and Apache ZooKeeper. Data Governance and Integration – Quickly and easily load data, and manage according to policy. Workflow Manager provides workflows for data governance, while Apache Flume and Sqoop enable easy data ingestion, as do the NFS and WebHDFS interfaces to HDFS. Workflow Management – Workflow Manager allows you to easily create and schedule workflows and monitor workflow jobs. It is based on the Apache Oozie workflow engine that allows users to connect and automate the execution of big data processing tasks into a defined workflow. Apache Flume – Flume allows you to efficiently aggregate and move large amounts of log data from many different sources to Hadoop. Apache Sqoop – Sqoop is a tool that speeds and eases movement of data in and out of Hadoop. It provides a reliable parallel load for various, popular enterprise data sources. Security – Address requirements of Authentication, Authorization, Accounting and Data Protection. Security is provided at every layer of the Hadoop stack from HDFS and YARN to Hive and the other Data Access components on up through the entire perimeter of the cluster via Apache Knox. Apache Knox – The Knox Gateway (“Knox”) provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal of the project is to simplify Hadoop security for users who access the cluster data and execute jobs, and for operators who control access to the cluster. Apache Ranger – Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster. It provides central security policy administration across the core enterprise security requirements of authorization, accounting and data protection. Operations – Provision, manage, monitor and operate Hadoop clusters at scale. Apache Ambari – An open source installation lifecycle management, administration and monitoring system for Apache Hadoop clusters. Apache Oozie – Oozie Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work. Apache ZooKeeper – A highly available system for coordinating distributed processes. Distributed applications use ZooKeeper to store and mediate updates to important configuration information.
Distributed Storage Distributed & parallel computation Horizontal scalability
Functions of NameNode: It is the master daemon that maintains and manages the DataNodes (slave nodes) It records the metadata of all the files stored in the cluster, e.g. The location of blocks stored, the size of the files, permissions, hierarchy, etc. There are two files associated with the metadata: FsImage: It contains the complete state of the file system namespace since the start of the NameNode. EditLogs: It contains all the recent modifications made to the file system with respect to the most recent FsImage. It records each change that takes place to the file system metadata. For example, if a file is deleted in HDFS, the NameNode will immediately record this in the EditLog. It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live. It keeps a record of all the blocks in HDFS and in which nodes these blocks are located. The NameNode is also responsible to take care of the replication factor of all the blocks which we will discuss in detail later in this HDFS tutorial blog. In case of the DataNode failure, the NameNode chooses new DataNodes for new replicas,balance disk usage and manages the communication traffic to the DataNodes. unctions of DataNode: These are slave daemons or process which runs on each slave machine. The actual data is stored on DataNodes. The DataNodes perform the low-level read and write requests from the file system’s clients. They send heartbeats to the NameNode periodically to report the overall health of HDFS, by default, this frequency is set to 3 seconds.
Blocks are the nothing but the smallest continuous location on your hard drive where data is stored. In general, in any of the File System, you store the data as a collection of blocks. Similarly, HDFS stores each file as blocks which are scattered throughout the Apache Hadoop cluster. The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x) which you can configure as per your requirement.
hdfs fsck / hdfs dfs –ls / hdfs dfs –mkdir /new_data hdfs dfs –du –s / hdfs dfs –cat / hdfs dfs –text hdfs dfs –copyFromLocal /home/test / hdfs dfs –copyToLocal / hdfs dfs –count /user hdfs dfs –rm / hdfs dfs –rm –r / hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2 hdfs dfs -expunge hdfs dfs –rmdir hdfs dfs –help
WY Part of the core Hadoop project, YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics. YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture. What YARN Does YARN is the prerequisite for Enterprise Hadoop, providing resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters. YARN also extends the power of Hadoop to incumbent and new technologies found within the data center so that they can take advantage of cost effective, linear-scale storage and processing. It provides ISVs and developers a consistent framework for writing data access applications that run IN Hadoop. HYW YARN’s original purpose was to split up the two major responsibilities of the JobTracker/TaskTracker into separate entities: a global ResourceManager a per-application ApplicationMaster a per-node slave NodeManager a per-application Container running on a NodeManager
Map stage: The map or mapper‟s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data. Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer‟s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large datasets. HBase scales linearly to handle huge data sets with billions of rows and millions of columns, and it easily combines data sources that use a wide variety of different structures and schemas. HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN. Apache HBase provides random, real time access to your data in Hadoop. It was created for hosting very large tables, making it a great choice to store multi-structured or sparse data. Users can query HBase for a particular point in time, making “flashback” queries possible. HBase HMaster performs DDL operations (create and delete tables) and assigns regions to the Region servers as you can see in the above image. It coordinates and manages the Region Server (similar as NameNode manages DataNode in HDFS). It assigns regions to the Region Servers on startup and re-assigns regions to Region Servers during recovery and load balancing. It monitors all the Region Server’s instances in the cluster (with the help of Zookeeper) and performs recovery activities whenever any Region Server is down. It provides an interface for creating, deleting and updating tables.
The META table is a special HBase catalog table. It maintains a list of all the Regions Servers in the HBase storage system, as you can see in the above image. Looking at the figure you can see, .META file maintains the table in form of keys and values. Key represents the start key of the region and its id whereas the value contains the path of the Region Server. WAL: As you can conclude from the above image, Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment. The WAL stores the new data that hasn’t been persisted or committed to the permanent storage. It is used in case of failure to recover the data sets. Block Cache: From the above image, it is clearly visible that Block Cache resides in the top of Region Server. It stores the frequently read data in the memory. If the data in BlockCache is least recently used, then that data is removed from BlockCache. MemStore: It is the write cache. It stores all the incoming data before committing it to the disk or permanent memory. There is one MemStore for each column family in a region. As you can see in the image, there are multiple MemStores for a region because each region contains multiple column families. The data is sorted in lexicographical order before committing it to the disk. HFile: From the above figure you can see HFile is stored on HDFS. Thus it stores the actual cells on the disk. MemStore commits the data to HFile when the size of MemStore exceeds.
SQL Database System and Hadoop – MapReduce framework Infrastructure on top of Hadoop Hive defines a simple SQL-like query language to querying and managing large datasets called Hive-QL ( HQL ). It’s easy to use if you’re familiar with SQL Language. 1. The Apache Hive distributed storage. 2. Hive provides tools to enable easy data extract/transform/load (ETL) 3. It provides the structure on a variety of data formats. 4. By using Hive, we can access files stored in Hadoop Distributed File System (HDFS is used to querying and managing large datasets residing in) or in other data storage systems such as Apache HBase. • Hive is not designed for Online transaction processing (OLTP ), it is only used for the Online Analytical Processing. • Hive supports overwriting or apprehending data, but not updates and deletes. • In Hive, sub queries are not supported.
Hive Clients: Hive supports application written in many languages like Java, C++, Python etc. using JDBC, Thrift and ODBC drivers. Hence one can always write hive client application written in a language of their choice. Hive Services: Apache Hive provides various services like CLI, Web Interface etc. to perform queries. We will explore each one of them shortly in this Hive tutorial blog. Processing framework and Resource Management: Internally, Hive uses Hadoop MapReduce framework as de facto engine to execute the queries. Hadoop MapReduce framework is a separate topic in itself and therefore, is not discussed here. Distributed Storage: As Hive is installed on top of Hadoop, it uses the underlying HDFS for the distributed storage. You can refer to the HDFS blog to learn more about it. edureka
Hive determines the bucket number for a row by using the formula: hash_function (bucketing_column) modulo (num_of_buckets). Here, hash_function depends on the column data type. For example, if you are bucketing the table on the basis of some column, let’s say user_id, of INT datatype, the hash_function will be – hash_function (user_id)= integer value of user_id. And, suppose you have created two buckets, then Hive will determine the rows going to bucket 1 in each partition by calculating: (value of user_id) modulo (2). Therefore, in this case, rows having user_id ending with an even integer digit will reside in a same bucket corresponding to each partition. The hash_function for other data types is a bit complex to calculate and in fact, for a string it is not even humanly recognizable. Note: If you are using Apache Hive 0.x or 1.x, you have to issue command – set hive.enforce.bucketing = true; from your Hive terminal before performing bucketing. This will allow you to have the correct number of reducer while using cluster by clause for bucketing a column. In case you have not done it, you may find the number of files that has been generated in your table directory are not equal to the number of buckets. As an alternative, you may also set the number of reducer equal to the number of buckets by using set mapred.reduce.task = num_bucket. Adv of buckets A map side join requires the data belonging to a unique join key to be present in the same partition. But what about those cases where your partition key differs from join? Therefore, in these cases you can perform a map side join by bucketing the table using the join key. Bucketing makes the sampling process more efficient and therefore, allows us to decrease the query time.
we have data for three departments in our student_details table – CSE, ECE and Civil. Therefore, we will have three partitions in total for each of the departments as shown in the image below. And, for each department we will have all the data regarding that very department residing in a separate sub – directory under the Hive table directory. For example, all the student data regarding CSE departments will be stored in user/hive/warehouse/student_details/dept.=CSE. So, the queries regarding CSE students would only have to look through the data present in the CSE partition. This makes partitioning very useful as it reduces the query latency by scanning only relevant partitioned data instead of the whole data set. In fact, in real world implementations, you will be dealing with hundreds of TBs of data. So, imagine scanning this huge amount of data for some query where 95% data scanned by you was un-relevant to your query.
Pig Latin is a high-level data flow language, whereas MapReduce is a low-level data processing paradigm. Without writing complex Java implementations in MapReduce, programmers can achieve the same implementations very easily using Pig Latin. Apache Pig uses multi-query approach (i.e. using a single query of Pig Latin we can accomplish multiple MapReduce tasks), which reduces the length of the code by 20 times. Hence, this reduces the development period by almost 16 times. Pig provides many built-in operators to support data operations like joins, filters, ordering, sorting etc. Whereas to perform the same function in MapReduce is a humongous task. Performing a Join operation in Apache Pig is simple. Whereas it is difficult in MapReduce to perform a Join operation between the data sets, as it requires multiple MapReduce tasks to be executed sequentially to fulfill the job. In addition, it also provides nested data types like tuples, bags, and maps that are missing from MapReduce.
lume components interact in the following way: A flow in Flume starts from the Client. The Client transmits the Event to a Source operating within the Agent. The Source receiving this Event then delivers it to one or more Channels. One or more Sinks operating within the same Agent drains these Channels. Channels decouple the ingestion rate from drain rate using the familiar producer-consumer model of data exchange. When spikes in client side activity cause data to be generated faster than can be handled by the provisioned destination capacity can handle, the Channel size increases. This allows sources to continue normal operation for the duration of the spike. The Sink of one Agent can be chained to the Source of another Agent. This chaining enables the creation of complex data flow topologies. https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB Flume only ingests unstructured data or semi-structured data into HDFS. While Sqoop can import as well as export structured data from RDBMS or Enterprise data warehouses to HDFS or vice versa.
Oozie Workflow Jobs− These are Directed Acyclic Graphs (DAGs) which specifies a sequence of actions to be executed. Oozie Coordinator Jobs− These consist of workflow jobs triggered by time and data availability. Oozie Bundles− These can be referred as a package of multiple coordinators and workflow jobs.
http://spark.apache.org/docs/latest/quick-start.html
Nathan Marz
Attribute Data warehouse Data lake Schema Schema-on-write Schema-on-read Attribute Scale Access Methods Workload Data Data Complexity Cost/ Efficiency Benefits Data warehouse Scales to moderate to large volumes at moderate cost Accessed through standardized SQL and BI tools Supports batch processing as well as thousands of concurrent users performing interactive analytics CleansedComplex integrations Efficiently uses CPU/IO but high storage and processing costs • Transform once, use many • Easy to consume data• Fast response times• Mature governance • Provides a single enterprise-wide view of data from multiple sources • Clean, safe, secure data • High concurrency• Operational integration • Time consuming• Expensive• Difficult to conduct ad hoc and exploratory analytics • Only structured data Data lake Scales to huge volumes at low cost Accessed through SQL-like systems, programs created by developers and also supports big data analytics tools Supports batch and stream processing, plus an improved capability over data warehouses to support big data inquiries from users Raw and refined Complex processing Efficiently uses storage and processing capabilities at very low cost • Transforms the economics of storing large amounts of data • Easy to consume data• Fast response times• Mature governance• Provides a single enterprise-wide view of data• Scales to execute on tens of thousands of servers • Allows use of any tool • Enables analysis to begin as soon as data arrives• Allows usage of structured and unstructured content form a single source• Supports Agile modeling by allowing users to change models, applications and queries • Analytics and big data analytics • Complexity of big data ecosystem• Lack of visibility if not managed and organized • Big data skills gap
The above design depicts the application of machine learning models using big data infra structure, each module is connected via either rest APIs or RabbitMQ messaging. RMQ used to orchestrate the data processing pipelines with S3 as intermediate storage and REST APIs to manage the model management. Model could be created using SparkML, PySpark, R, Python, H2o. There is enough flexibility in the design to adopt the machine learning from variety of data science groups/projects. Complete design based on open source with no lock-in to any vendor. Rest server is made using Spray Akka made the system to align with Hadoop platform and scales to millions of model and score requests.

Analytics using big data technologies

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Analytics using big data technologies

Similar to Analytics using big data technologies (20)

Recently uploaded

Recently uploaded (20)

Analytics using big data technologies

Editor's Notes