Overview of big data technologies like Hadoop, Hive, Pig, HDFS, Map Reduce, Spark and example architectures for designing big data products and platforms.
2. Agenda
• Big data eco system
• Architecture of Hadoop and spark
• Technology details of big data eco system
• Lambda architecture
• Big data architecture principles
• Use cases examples
3. Hadoop
Distributed storage and processing system
Designed for Scalability
Hadoop 1.0 – HDFS and MapReduce
Hadoop 2.0 – 1.0 + Yarn resource management
Latest version – 3.1.1
10. Hadoop - HDFS
Distributed file system
Name node
Data node
Forms the basis for Hadoop eco system
Low latency data access
Blocks replicated across nodes for fault tolerance and high availability
30. Technology trend
CPUs not fast
enough
GPUs costly
and not fit for
all problems
Elastic cloud
Open eco
system
Batch
computations
Serialization
protocols
Random
access NoSql
databases
Message
queues
Realtime
computations
The term “BIG DATA” attracts many because it raises many questions like what it is? how to use? How it is beneficial?
We will discuss broad spectrum of technologies like Hadoop, MapReduce, Spark, Hive, Pig, Hbase, AWS, Azure, S3, Lambda Architecture, current technology trends, current status of big data companies like HDP, MapR, cloudera, big data design principles, real world big data solutions.
We will start with what is Hadoop, how this is evolved? What are the tools available around Hadoop?
Architecture of Hadoop and spark. Technology deep dive with details on each eco system components like HDFS, MR, Hive, PIG, Hbase, Spark, Yarn..
Summaries each components role, how to use them, when to use them, pros and cons.
While doing so, explain the problems and best practices around these technologies.
Current technology trend and where the world is moving for big data.
Explain the cloud technologies like AWS, Azure and its impact on the big data.
Deep dive into Lambda architecture how to use this architecture for big data.
Deep dive demo sessions on big data processing using Spark.
Also we will explain the big data architecture principles, design, best practices of real world big data systems built using these technologies.
Intended for Beginners and technology enthusiast who wants to understand big data ecosystem.
Hadoop is a software library for distributed processing of large datasets across clusters of computers using simple programming paradigm.
Scale out
Distributed storage
Distributed processing
Eventual consistency
In 2010, Facebook claimed to have one of the largest HDFS cluster storing 21 Petabytes of data.
In 2012, Facebook declared that they have the largest single HDFS cluster with more than 100 PB of data.
And Yahoo! has more than 100,000 CPU in over 40,000 servers running Hadoop, with its biggest Hadoop cluster running 4,500 nodes. All told, Yahoo! stores 455 petabytes of data in HDFS.
In fact, by 2013, most of the big names in the Fortune 50 started using Hadoop.
The base Apache Hadoop framework is composed of the following modules:
Hadoop Common – contains libraries and utilities needed by other Hadoop modules.
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications.
Hadoop MapReduce – a programming model for large scale data processing.
Apache Hive – Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for large datasets stored in HDFS.
Apache Pig – A platform for processing and analyzing large data sets. Pig consists of a high-level language (Pig Latin) for expressing data analysis programs paired with the MapReduce framework for processing these programs.
MapReduce – MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner.
Apache Spark – Spark is ideal for in-memory data processing. It allows data scientists to implement fast, iterative algorithms for advanced analytics such as clustering and classification of datasets.
Apache Storm – Storm is a distributed real-time computation system for processing fast, large streams of data adding reliable real-time data processing capabilities to Apache Hadoop 2.x
Apache HBase – A column-oriented NoSQL data storage system that provides random real-time read/write access to big data for user applications.
Apache Tez – Tez generalizes the MapReduce paradigm to a more powerful framework for executing a complex DAG (directed acyclic graph) of tasks for near real-time big data processing.
Apache Kafka – Kafka is a fast and scalable publish-subscribe messaging system that is often used in place of traditional message brokers because of its higher throughput, replication, and fault tolerance.
Apache HCatalog – A table and metadata management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop.
Apache Slider – A framework for deployment of long-running data access applications in Hadoop. Slider leverages YARN’s resource management capabilities to deploy those applications, to manage their lifecycles and scale them up or down.
Apache Solr – Solr is the open source platform for searches of data stored in Hadoop. Solr enables powerful full-text search and near real-time indexing on many of the world’s largest Internet sites.
Apache Mahout – Mahout provides scalable machine learning algorithms for Hadoop which aids with data science for clustering, classification and batch based collaborative filtering.
Apache Accumulo – Accumulo is a high performance data storage and retrieval system with cell-level access control. It is a scalable implementation of Google’s Big Table design that works on top of Apache Hadoop and Apache ZooKeeper.
Data Governance and Integration – Quickly and easily load data, and manage according to policy. Workflow Manager provides workflows for data governance, while Apache Flume and Sqoop enable easy data ingestion, as do the NFS and WebHDFS interfaces to HDFS.
Workflow Management – Workflow Manager allows you to easily create and schedule workflows and monitor workflow jobs. It is based on the Apache Oozie workflow engine that allows users to connect and automate the execution of big data processing tasks into a defined workflow.
Apache Flume – Flume allows you to efficiently aggregate and move large amounts of log data from many different sources to Hadoop.
Apache Sqoop – Sqoop is a tool that speeds and eases movement of data in and out of Hadoop. It provides a reliable parallel load for various, popular enterprise data sources.
Security – Address requirements of Authentication, Authorization, Accounting and Data Protection. Security is provided at every layer of the Hadoop stack from HDFS and YARN to Hive and the other Data Access components on up through the entire perimeter of the cluster via Apache Knox.
Apache Knox – The Knox Gateway (“Knox”) provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal of the project is to simplify Hadoop security for users who access the cluster data and execute jobs, and for operators who control access to the cluster.
Apache Ranger – Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster. It provides central security policy administration across the core enterprise security requirements of authorization, accounting and data protection.
Operations – Provision, manage, monitor and operate Hadoop clusters at scale.
Apache Ambari – An open source installation lifecycle management, administration and monitoring system for Apache Hadoop clusters.
Apache Oozie – Oozie Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work.
Apache ZooKeeper – A highly available system for coordinating distributed processes. Distributed applications use ZooKeeper to store and mediate updates to important configuration information.
Functions of NameNode:
It is the master daemon that maintains and manages the DataNodes (slave nodes)
It records the metadata of all the files stored in the cluster, e.g. The location of blocks stored, the size of the files, permissions, hierarchy, etc. There are two files associated with the metadata:
FsImage: It contains the complete state of the file system namespace since the start of the NameNode.
EditLogs: It contains all the recent modifications made to the file system with respect to the most recent FsImage.
It records each change that takes place to the file system metadata. For example, if a file is deleted in HDFS, the NameNode will immediately record this in the EditLog.
It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live.
It keeps a record of all the blocks in HDFS and in which nodes these blocks are located.
The NameNode is also responsible to take care of the replication factor of all the blocks which we will discuss in detail later in this HDFS tutorial blog.
In case of the DataNode failure, the NameNode chooses new DataNodes for new replicas,balance disk usage and manages the communication traffic to the DataNodes.
unctions of DataNode:
These are slave daemons or process which runs on each slave machine.
The actual data is stored on DataNodes.
The DataNodes perform the low-level read and write requests from the file system’s clients.
They send heartbeats to the NameNode periodically to report the overall health of HDFS, by default, this frequency is set to 3 seconds.
Blocks are the nothing but the smallest continuous location on your hard drive where data is stored.
In general, in any of the File System, you store the data as a collection of blocks.
Similarly, HDFS stores each file as blocks which are scattered throughout the Apache Hadoop cluster.
The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache Hadoop 1.x) which you can configure as per your requirement.
WY
Part of the core Hadoop project, YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics.
YARN is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture.
What YARN Does
YARN is the prerequisite for Enterprise Hadoop, providing resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters.
YARN also extends the power of Hadoop to incumbent and new technologies found within the data center so that they can take advantage of cost effective, linear-scale storage and processing. It provides ISVs and developers a consistent framework for writing data access applications that run IN Hadoop.
HYW
YARN’s original purpose was to split up the two major responsibilities of the JobTracker/TaskTracker into separate entities:
a global ResourceManager
a per-application ApplicationMaster
a per-node slave NodeManager
a per-application Container running on a NodeManager
Map stage: The map or mapper‟s job is to process the input data. Generally the input data is in the form of file
or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line
by line. The mapper processes the data and creates several small chunks of data.
Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer‟s job is to
process the data that comes from the mapper. After processing, it produces a new set of output, which will be
stored in the HDFS
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large datasets.
HBase scales linearly to handle huge data sets with billions of rows and millions of columns, and it easily combines data sources that use a wide variety of different structures and schemas. HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
Apache HBase provides random, real time access to your data in Hadoop. It was created for hosting very large tables, making it a great choice to store multi-structured or sparse data. Users can query HBase for a particular point in time, making “flashback” queries possible.
HBase HMaster performs DDL operations (create and delete tables) and assigns regions to the Region servers as you can see in the above image.
It coordinates and manages the Region Server (similar as NameNode manages DataNode in HDFS).
It assigns regions to the Region Servers on startup and re-assigns regions to Region Servers during recovery and load balancing.
It monitors all the Region Server’s instances in the cluster (with the help of Zookeeper) and performs recovery activities whenever any Region Server is down.
It provides an interface for creating, deleting and updating tables.
The META table is a special HBase catalog table. It maintains a list of all the Regions Servers in the HBase storage system, as you can see in the above image.
Looking at the figure you can see, .META file maintains the table in form of keys and values. Key represents the start key of the region and its id whereas the value contains the path of the Region Server.
WAL: As you can conclude from the above image, Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment. The WAL stores the new data that hasn’t been persisted or committed to the permanent storage. It is used in case of failure to recover the data sets.
Block Cache: From the above image, it is clearly visible that Block Cache resides in the top of Region Server. It stores the frequently read data in the memory. If the data in BlockCache is least recently used, then that data is removed from BlockCache.
MemStore: It is the write cache. It stores all the incoming data before committing it to the disk or permanent memory. There is one MemStore for each column family in a region. As you can see in the image, there are multiple MemStores for a region because each region contains multiple column families. The data is sorted in lexicographical order before committing it to the disk.
HFile: From the above figure you can see HFile is stored on HDFS. Thus it stores the actual cells on the disk. MemStore commits the data to HFile when the size of MemStore exceeds.
SQL Database System and Hadoop – MapReduce framework
Infrastructure on top of Hadoop
Hive defines a simple SQL-like query language to querying and managing large datasets called Hive-QL ( HQL ). It’s easy to use if you’re familiar with SQL Language.
1. The Apache Hive distributed storage.
2. Hive provides tools to enable easy data extract/transform/load (ETL)
3. It provides the structure on a variety of data formats.
4. By using Hive, we can access files stored in Hadoop Distributed File System (HDFS is used to querying and managing large datasets residing in) or in other data storage systems such as Apache HBase.
• Hive is not designed for Online transaction processing (OLTP ), it is only used for the Online Analytical Processing.
• Hive supports overwriting or apprehending data, but not updates and deletes.
• In Hive, sub queries are not supported.
Hive Clients: Hive supports application written in many languages like Java, C++, Python etc. using JDBC, Thrift and ODBC drivers. Hence one can always write hive client application written in a language of their choice.
Hive Services: Apache Hive provides various services like CLI, Web Interface etc. to perform queries. We will explore each one of them shortly in this Hive tutorial blog.
Processing framework and Resource Management: Internally, Hive uses Hadoop MapReduce framework as de facto engine to execute the queries. Hadoop MapReduce framework is a separate topic in itself and therefore, is not discussed here.
Distributed Storage: As Hive is installed on top of Hadoop, it uses the underlying HDFS for the distributed storage. You can refer to the HDFS blog to learn more about it.
edureka
Hive determines the bucket number for a row by using the formula: hash_function (bucketing_column) modulo (num_of_buckets). Here, hash_function depends on the column data type. For example, if you are bucketing the table on the basis of some column, let’s say user_id, of INT datatype, the hash_function will be – hash_function (user_id)= integer value of user_id. And, suppose you have created two buckets, then Hive will determine the rows going to bucket 1 in each partition by calculating: (value of user_id) modulo (2). Therefore, in this case, rows having user_id ending with an even integer digit will reside in a same bucket corresponding to each partition. The hash_function for other data types is a bit complex to calculate and in fact, for a string it is not even humanly recognizable.
Note: If you are using Apache Hive 0.x or 1.x, you have to issue command – set hive.enforce.bucketing = true; from your Hive terminal before performing bucketing. This will allow you to have the correct number of reducer while using cluster by clause for bucketing a column. In case you have not done it, you may find the number of files that has been generated in your table directory are not equal to the number of buckets. As an alternative, you may also set the number of reducer equal to the number of buckets by using set mapred.reduce.task = num_bucket.
Adv of buckets
A map side join requires the data belonging to a unique join key to be present in the same partition. But what about those cases where your partition key differs from join? Therefore, in these cases you can perform a map side join by bucketing the table using the join key.
Bucketing makes the sampling process more efficient and therefore, allows us to decrease the query time.
we have data for three departments in our student_details table – CSE, ECE and Civil. Therefore, we will have three partitions in total for each of the departments as shown in the image below. And, for each department we will have all the data regarding that very department residing in a separate sub – directory under the Hive table directory. For example, all the student data regarding CSE departments will be stored in user/hive/warehouse/student_details/dept.=CSE. So, the queries regarding CSE students would only have to look through the data present in the CSE partition. This makes partitioning very useful as it reduces the query latency by scanning only relevant partitioned data instead of the whole data set. In fact, in real world implementations, you will be dealing with hundreds of TBs of data. So, imagine scanning this huge amount of data for some query where 95% data scanned by you was un-relevant to your query.
Pig Latin is a high-level data flow language, whereas MapReduce is a low-level data processing paradigm.
Without writing complex Java implementations in MapReduce, programmers can achieve the same implementations very easily using Pig Latin.
Apache Pig uses multi-query approach (i.e. using a single query of Pig Latin we can accomplish multiple MapReduce tasks), which reduces the length of the code by 20 times. Hence, this reduces the development period by almost 16 times.
Pig provides many built-in operators to support data operations like joins, filters, ordering, sorting etc. Whereas to perform the same function in MapReduce is a humongous task.
Performing a Join operation in Apache Pig is simple. Whereas it is difficult in MapReduce to perform a Join operation between the data sets, as it requires multiple MapReduce tasks to be executed sequentially to fulfill the job.
In addition, it also provides nested data types like tuples, bags, and maps that are missing from MapReduce.
lume components interact in the following way:
A flow in Flume starts from the Client.
The Client transmits the Event to a Source operating within the Agent.
The Source receiving this Event then delivers it to one or more Channels.
One or more Sinks operating within the same Agent drains these Channels.
Channels decouple the ingestion rate from drain rate using the familiar producer-consumer model of data exchange.
When spikes in client side activity cause data to be generated faster than can be handled by the provisioned destination capacity can handle, the Channel size increases. This allows sources to continue normal operation for the duration of the spike.
The Sink of one Agent can be chained to the Source of another Agent. This chaining enables the creation of complex data flow topologies.
https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
Flume only ingests unstructured data or semi-structured data into HDFS.
While Sqoop can import as well as export structured data from RDBMS or Enterprise data warehouses to HDFS or vice versa.
Oozie Workflow Jobs− These are Directed Acyclic Graphs (DAGs) which specifies a sequence of actions to be executed.
Oozie Coordinator Jobs− These consist of workflow jobs triggered by time and data availability.
Oozie Bundles− These can be referred as a package of multiple coordinators and workflow jobs.
Attribute Data warehouse Data lake
Schema Schema-on-write Schema-on-read
Attribute
Scale
Access Methods
Workload
Data
Data Complexity
Cost/ Efficiency
Benefits
Data warehouse
Scales to moderate to large volumes at moderate cost
Accessed through standardized SQL and BI tools
Supports batch processing as well as thousands of concurrent users performing interactive analytics
CleansedComplex integrations
Efficiently uses CPU/IO but high storage and processing costs
• Transform once, use many • Easy to consume data• Fast response times• Mature governance
• Provides a single enterprise-wide view of data from multiple sources
• Clean, safe, secure data • High concurrency• Operational integration
• Time consuming• Expensive• Difficult to conduct ad hoc and
exploratory analytics • Only structured data
Data lake
Scales to huge volumes at low cost
Accessed through SQL-like systems, programs created by developers and also supports big data analytics tools
Supports batch and stream processing, plus an improved capability over data warehouses to support big data inquiries from users
Raw and refined Complex processing
Efficiently uses storage and processing capabilities at very low cost
• Transforms the economics of storing large amounts of data
• Easy to consume data• Fast response times• Mature governance• Provides a single enterprise-wide view of data• Scales to execute on tens of thousands of servers • Allows use of any tool
• Enables analysis to begin as soon as data arrives• Allows usage of structured and unstructured content form
a single source• Supports Agile modeling by allowing users to change
models, applications and queries • Analytics and big data analytics
• Complexity of big data ecosystem• Lack of visibility if not managed and organized • Big data skills gap
The above design depicts the application of machine learning models using big data infra structure, each module is connected via either rest APIs or RabbitMQ messaging. RMQ used to orchestrate the data processing pipelines with S3 as intermediate storage and REST APIs to manage the model management. Model could be created using SparkML, PySpark, R, Python, H2o. There is enough flexibility in the design to adopt the machine learning from variety of data science groups/projects. Complete design based on open source with no lock-in to any vendor.
Rest server is made using Spray Akka made the system to align with Hadoop platform and scales to millions of model and score requests.