This document discusses Hadoop architecture approaches for big data, specifically data lake architecture and Lambda architecture. It provides an overview of these architectures, including their core components and how they handle batch and real-time processing. A data lake architecture uses Hadoop for flexible storage of all data, while a Lambda architecture combines batch and real-time processing to provide views of both old and new data. The document also covers classifying big data by characteristics like processing type, data sources, and format to determine the appropriate architecture.
Difference between data warehouse and data miningmaxonlinetr
The document discusses data warehousing, online analytical processing (OLAP), and data mining. It describes a data warehouse as a subject-oriented collection of integrated data used to support management decision making. The typical architecture involves extracting, transforming, and loading data from operational systems into a data warehouse for analysis. Dimensional data modeling, including star schemas, is used to design data warehouses to enable efficient ad-hoc querying. OLAP and data mining tools are then used to analyze the data for patterns and insights.
This document provides an overview of data warehousing and data mining. It begins by defining a data warehouse as a system that contains historical and cumulative data from single or multiple sources for simplifying reporting, analysis, and decision making. It describes three common data warehouse architectures and the key components of a data warehouse, including the database, ETL tools, metadata, query tools, and data marts. The document then defines data mining as extracting usable data from raw data using software to analyze patterns. It outlines descriptive and predictive data mining tasks and techniques like clustering, associations, summarization, prediction, and classification. Finally, it provides examples of data mining applications and discusses how AWS services like Amazon Redshift can provide scalable data warehousing
The document discusses key concepts related to data warehousing including:
1) What data warehousing is, its main components, and differences from OLTP systems.
2) The typical architecture of a data warehouse including operational data sources, storage, and end-user access tools.
3) Important considerations like data flows, integration, management of metadata, and tools/technologies used.
4) Additional topics such as benefits, challenges, administration, and data marts.
this is the ppt this contains definition of data ware house , data , ware house, data modeling , data warehouse architecture and its type , data warehouse types, single tire, two tire, three tire .
This document outlines the objectives and units of study for a course on data warehousing and mining. The 5 units cover: 1) data warehousing components and architecture; 2) business analysis tools; 3) data mining tasks and techniques; 4) association rule mining and classification; and 5) clustering applications and trends in data mining. Key topics include extracting, transforming, and loading data into a data warehouse; using metadata and query/reporting tools; building dependent data marts; and applying data mining techniques like classification, clustering, and association rule mining. The course aims to introduce these concepts and their real-world implications.
This document discusses big data and Hadoop. It defines big data as high volume data that cannot be easily stored or analyzed with traditional methods. Hadoop is an open-source software framework that can store and process large data sets across clusters of commodity hardware. It has two main components - HDFS for storage and MapReduce for distributed processing. HDFS stores data across clusters and replicates it for fault tolerance, while MapReduce allows data to be mapped and reduced for analysis.
What is Data Warehouse?OLTP vs. OLAP, Conceptual Modeling of Data Warehouses,Data Warehousing Components, Data Warehousing Components, Building a Data Warehouse, Mapping the Data Warehouse to a Multiprocessor Architecture, Database Architectures for Parallel Processing
This document provides an overview of data mining and data warehousing. It discusses the history and evolution of databases from the 1960s to today. Data mining is defined as using automated tools to extract hidden patterns from large databases to address the problem of data explosion. Descriptive and predictive models are used in data mining. Data warehousing involves integrating data from multiple sources into a centralized database to support analysis and decision making.
Difference between data warehouse and data miningmaxonlinetr
The document discusses data warehousing, online analytical processing (OLAP), and data mining. It describes a data warehouse as a subject-oriented collection of integrated data used to support management decision making. The typical architecture involves extracting, transforming, and loading data from operational systems into a data warehouse for analysis. Dimensional data modeling, including star schemas, is used to design data warehouses to enable efficient ad-hoc querying. OLAP and data mining tools are then used to analyze the data for patterns and insights.
This document provides an overview of data warehousing and data mining. It begins by defining a data warehouse as a system that contains historical and cumulative data from single or multiple sources for simplifying reporting, analysis, and decision making. It describes three common data warehouse architectures and the key components of a data warehouse, including the database, ETL tools, metadata, query tools, and data marts. The document then defines data mining as extracting usable data from raw data using software to analyze patterns. It outlines descriptive and predictive data mining tasks and techniques like clustering, associations, summarization, prediction, and classification. Finally, it provides examples of data mining applications and discusses how AWS services like Amazon Redshift can provide scalable data warehousing
The document discusses key concepts related to data warehousing including:
1) What data warehousing is, its main components, and differences from OLTP systems.
2) The typical architecture of a data warehouse including operational data sources, storage, and end-user access tools.
3) Important considerations like data flows, integration, management of metadata, and tools/technologies used.
4) Additional topics such as benefits, challenges, administration, and data marts.
this is the ppt this contains definition of data ware house , data , ware house, data modeling , data warehouse architecture and its type , data warehouse types, single tire, two tire, three tire .
This document outlines the objectives and units of study for a course on data warehousing and mining. The 5 units cover: 1) data warehousing components and architecture; 2) business analysis tools; 3) data mining tasks and techniques; 4) association rule mining and classification; and 5) clustering applications and trends in data mining. Key topics include extracting, transforming, and loading data into a data warehouse; using metadata and query/reporting tools; building dependent data marts; and applying data mining techniques like classification, clustering, and association rule mining. The course aims to introduce these concepts and their real-world implications.
This document discusses big data and Hadoop. It defines big data as high volume data that cannot be easily stored or analyzed with traditional methods. Hadoop is an open-source software framework that can store and process large data sets across clusters of commodity hardware. It has two main components - HDFS for storage and MapReduce for distributed processing. HDFS stores data across clusters and replicates it for fault tolerance, while MapReduce allows data to be mapped and reduced for analysis.
What is Data Warehouse?OLTP vs. OLAP, Conceptual Modeling of Data Warehouses,Data Warehousing Components, Data Warehousing Components, Building a Data Warehouse, Mapping the Data Warehouse to a Multiprocessor Architecture, Database Architectures for Parallel Processing
This document provides an overview of data mining and data warehousing. It discusses the history and evolution of databases from the 1960s to today. Data mining is defined as using automated tools to extract hidden patterns from large databases to address the problem of data explosion. Descriptive and predictive models are used in data mining. Data warehousing involves integrating data from multiple sources into a centralized database to support analysis and decision making.
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...IRJET Journal
This document discusses using machine learning algorithms and Hadoop to predict stock market performance based on historical stock data. It proposes a model that would collect stock data from various sources, preprocess the data to clean it, cluster the data using MapReduce, and then use a support vector machine algorithm to analyze the clustered data and generate stock predictions. The model is designed to take advantage of Hadoop's ability to process large datasets in parallel across multiple servers or clusters. The goal is to more accurately predict stock prices and identify market trends based on analyzing huge amounts of historical stock market data.
Iaetsd mapreduce streaming over cassandra datasetsIaetsd Iaetsd
This document discusses processing large datasets from Denmark's traffic using Apache Cassandra and MapReduce. It begins with an introduction to big data and how the volume, velocity, and variety of data requires alternative processing methods. Apache Cassandra is introduced as a distributed and scalable NoSQL database for storing large amounts of structured and unstructured data across servers. The document then discusses Cassandra's data model and system architecture. It describes how MapReduce can be used for distributed processing of datasets stored in Cassandra. The paper aims to process traffic datasets from Denmark using Cassandra and MapReduce to help the transportation department monitor traffic.
FellowBuddy.com is an innovative platform that brings students together to share notes, exam papers, study guides, project reports and presentation for upcoming exams.
We connect Students who have an understanding of course material with Students who need help.
Benefits:-
# Students can catch up on notes they missed because of an absence.
# Underachievers can find peer developed notes that break down lecture and study material in a way that they can understand
# Students can earn better grades, save time and study effectively
Our Vision & Mission – Simplifying Students Life
Our Belief – “The great breakthrough in your life comes when you realize it, that you can learn anything you need to learn; to accomplish any goal that you have set for yourself. This means there are no limits on what you can be, have or do.”
Like Us - https://www.facebook.com/FellowBuddycom
What is Data Mining? Data Mining is defined as extracting information from huge sets of data. In other words, we can say that data mining is the procedure of mining knowledge from data.
Data mining involves analyzing large amounts of data to discover patterns that can be used for purposes such as increasing sales, reducing costs, or detecting fraud. It allows companies to better understand customer behavior and develop more effective marketing strategies. Common data mining techniques used by retailers include loyalty programs to track purchasing patterns and target customers with personalized coupons. Data mining software uses techniques like classification, clustering, and prediction to analyze data from different perspectives and extract useful information and patterns.
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...Gihan Wikramanayake
M G N A S Fernando, G N Wikramanayake (2004) "Application of Data Warehousing and Data Mining to Exploitation for Supporting the Planning of Higher Education System in Sri Lanka", In:23rd National Information Technology Conference, pp. 114-120. Computer Society of Sri Lanka Colombo, Sri Lanka: CSSL Jul 8-9, ISBN: 955-9155-12-1
The document defines and describes key concepts related to data warehousing. It provides definitions of data warehousing, data warehouse features including being subject-oriented, integrated, and time-variant. It discusses why data warehousing is needed, using scenarios of companies wanting consolidated sales reports. The 3-tier architecture of extraction/transformation, data warehouse storage, and retrieval is covered. Data marts are defined as subsets of the data warehouse. Finally, the document contrasts databases with data warehouses and describes OLAP operations.
This white paper will present the opportunities laid down by
data lake and advanced analytics, as well as, the challenges
in integrating, mining and analyzing the data collected from
these sources. It goes over the important characteristics of
the data lake architecture and Data and Analytics as a
Service (DAaaS) model. It also delves into the features of a
successful data lake and its optimal designing. It goes over
data, applications, and analytics that are strung together to
speed-up the insight brewing process for industry’s
improvements with the help of a powerful architecture for
mining and analyzing unstructured data – data lake.
The document discusses data warehousing concepts including:
1) A data warehouse is a subject-oriented, integrated, and non-volatile collection of data used for decision making. It stores historical and current data from multiple sources.
2) The architecture of a data warehouse is typically three-tiered, with an operational data tier, data warehouse/data mart tier for storage, and client access tier. OLAP servers allow analysis of stored data.
3) ROLAP and MOLAP refer to relational and multidimensional approaches for OLAP. ROLAP dynamically generates data cubes from relational databases, while MOLAP pre-calculates and stores aggregated data in multidimensional structures.
A data warehouse consists of several key components:
- Current detail data from operational systems of record which is stored for analysis.
- Integration and transformation programs that convert operational data into a common format for the data warehouse.
- Summarized and archived data used for reporting and analysis over time.
- Metadata that describes the structure and meaning of the data.
Data warehouses are used for standard reporting, queries on summarized data, and data mining of patterns in large datasets to gain business insights.
Data warehousing combines data from multiple sources into a single database to provide businesses with analytics results from data mining, OLAP, scorecarding and reporting. It extracts, transforms and loads data from operational data stores and data marts into a data warehouse and staging area to integrate and store large amounts of corporate data. Data mining analyzes large databases to extract previously unknown and potentially useful patterns and relationships to improve business processes.
The document defines data mining as extracting useful information from large datasets. It discusses two main types of data mining tasks: descriptive tasks like frequent pattern mining and classification/prediction tasks like decision trees. Several data mining techniques are covered, including association, classification, clustering, prediction, sequential patterns, and decision trees. Real-world applications of data mining are also outlined, such as market basket analysis, fraud detection, healthcare, education, and CRM.
Big data analytics tools from vendors like IBM, Tableau, and SAS can help organizations process and analyze big data. For smaller organizations, Excel is often used, while larger organizations employ data mining, predictive analytics, and dashboards. Business intelligence applications include OLAP, data mining, and decision support systems. Big data comes from many sources like web logs, sensors, social networks, and scientific research. It is defined by the volume, variety, velocity, veracity, variability, and value of the data. Hadoop and MapReduce are common technologies for storing and analyzing big data across clusters of machines. Stream analytics is useful for real-time analysis of data like sensor data.
The document provides an overview of the key components and considerations for building a data warehouse. It discusses 7 main components: 1) the data warehouse database, 2) sourcing, acquisition, cleanup and transformation tools, 3) metadata, 4) access (query) tools, 5) data marts, 6) data warehouse administration and management, and 7) information delivery systems. It also outlines important design considerations, technical considerations, and implementation considerations that must be addressed when building a data warehouse environment.
A Computer database is a collection of logically related data that is stored in a computer system,so that a computer program or person using a query language can use it to answer queries. An operational database (OLTP) contains up-to-date, modifiable application specific data. A data warehouse (OLAP) is a subject-oriented, integrated, time-variant and non-volatile collection of data used to make business decisions. Hadoop Distributed File System (HDFS) allows storing large amount of data on a cloud of
machines. In this paper, we surveyed the literature related to operational databases, data warehouse and hadoop technology.
Infrastructure Considerations for Analytical WorkloadsCognizant
Using Apache Hadoop clusters and Mahout for analyzing big data workloads yields extraordinary performance; we offer a detailed comparison of running Hadoop in a physical vs. virtual infrastructure environment.
The document discusses data warehousing, data mining, and business intelligence applications. It explains that data warehousing organizes and structures data for analysis, and that data mining involves preprocessing, characterization, comparison, classification, and forecasting of data to discover knowledge. The final stage is presenting discovered knowledge to end users through visualization and business intelligence applications.
This document defines a data warehouse as a collection of corporate information derived from operational systems and external sources to support business decisions rather than operations. It discusses the purpose of data warehousing to realize the value of data and make better decisions. Key components like staging areas, data marts, and operational data stores are described. The document also outlines evolution of data warehouse architectures and best practices.
This document provides an overview of key concepts related to decision support systems (DSS) and data warehousing. It defines DSS as interactive computer systems that help decision makers use data, documents, models and communication technologies to identify and solve problems. It then discusses operational databases and how they differ from data warehouses in areas like data type, focus, users and more. Finally, it defines key characteristics of a data warehouse as being subject-oriented, integrated, time-variant and non-volatile to support management decision making.
The Big Data Importance – Tools and their UsageIRJET Journal
This document discusses big data, tools for analyzing big data, and opportunities that big data analytics provides. It begins by defining big data and its key characteristics of volume, variety and velocity. It then discusses tools for storing, managing and processing big data like Hadoop, MapReduce and HDFS. Finally, it outlines how big data analytics can be applied across different domains to enable new insights and informed decision making through analyzing large datasets.
About Streaming Data Solutions for HadoopLynn Langit
This document discusses selecting the best approach for fast big data and streaming analytics projects. It describes key considerations for the architectural design phases such as scalable ingestion, real-time ETL, analytics, alerts and actions, and visualization. Component selection factors include the overall architecture, enterprise-grade streaming engine, ease of use and development, and management/DevOps. The document provides definitions of relevant technologies and compares representative solutions to help identify the best fit based on an organization's needs and skills.
Lecture4 big data technology foundationshktripathy
The document discusses big data architecture and its components. It explains that big data architecture is needed when analyzing large datasets over 100GB in size or when processing massive amounts of structured and unstructured data from multiple sources. The architecture consists of several layers including data sources, ingestion, storage, physical infrastructure, platform management, processing, query, security, monitoring, analytics and visualization. It provides details on each layer and their functions in ingesting, storing, processing and analyzing large volumes of diverse data.
IRJET - A Prognosis Approach for Stock Market Prediction based on Term Streak...IRJET Journal
This document discusses using machine learning algorithms and Hadoop to predict stock market performance based on historical stock data. It proposes a model that would collect stock data from various sources, preprocess the data to clean it, cluster the data using MapReduce, and then use a support vector machine algorithm to analyze the clustered data and generate stock predictions. The model is designed to take advantage of Hadoop's ability to process large datasets in parallel across multiple servers or clusters. The goal is to more accurately predict stock prices and identify market trends based on analyzing huge amounts of historical stock market data.
Iaetsd mapreduce streaming over cassandra datasetsIaetsd Iaetsd
This document discusses processing large datasets from Denmark's traffic using Apache Cassandra and MapReduce. It begins with an introduction to big data and how the volume, velocity, and variety of data requires alternative processing methods. Apache Cassandra is introduced as a distributed and scalable NoSQL database for storing large amounts of structured and unstructured data across servers. The document then discusses Cassandra's data model and system architecture. It describes how MapReduce can be used for distributed processing of datasets stored in Cassandra. The paper aims to process traffic datasets from Denmark using Cassandra and MapReduce to help the transportation department monitor traffic.
FellowBuddy.com is an innovative platform that brings students together to share notes, exam papers, study guides, project reports and presentation for upcoming exams.
We connect Students who have an understanding of course material with Students who need help.
Benefits:-
# Students can catch up on notes they missed because of an absence.
# Underachievers can find peer developed notes that break down lecture and study material in a way that they can understand
# Students can earn better grades, save time and study effectively
Our Vision & Mission – Simplifying Students Life
Our Belief – “The great breakthrough in your life comes when you realize it, that you can learn anything you need to learn; to accomplish any goal that you have set for yourself. This means there are no limits on what you can be, have or do.”
Like Us - https://www.facebook.com/FellowBuddycom
What is Data Mining? Data Mining is defined as extracting information from huge sets of data. In other words, we can say that data mining is the procedure of mining knowledge from data.
Data mining involves analyzing large amounts of data to discover patterns that can be used for purposes such as increasing sales, reducing costs, or detecting fraud. It allows companies to better understand customer behavior and develop more effective marketing strategies. Common data mining techniques used by retailers include loyalty programs to track purchasing patterns and target customers with personalized coupons. Data mining software uses techniques like classification, clustering, and prediction to analyze data from different perspectives and extract useful information and patterns.
Application of Data Warehousing & Data Mining to Exploitation for Supporting ...Gihan Wikramanayake
M G N A S Fernando, G N Wikramanayake (2004) "Application of Data Warehousing and Data Mining to Exploitation for Supporting the Planning of Higher Education System in Sri Lanka", In:23rd National Information Technology Conference, pp. 114-120. Computer Society of Sri Lanka Colombo, Sri Lanka: CSSL Jul 8-9, ISBN: 955-9155-12-1
The document defines and describes key concepts related to data warehousing. It provides definitions of data warehousing, data warehouse features including being subject-oriented, integrated, and time-variant. It discusses why data warehousing is needed, using scenarios of companies wanting consolidated sales reports. The 3-tier architecture of extraction/transformation, data warehouse storage, and retrieval is covered. Data marts are defined as subsets of the data warehouse. Finally, the document contrasts databases with data warehouses and describes OLAP operations.
This white paper will present the opportunities laid down by
data lake and advanced analytics, as well as, the challenges
in integrating, mining and analyzing the data collected from
these sources. It goes over the important characteristics of
the data lake architecture and Data and Analytics as a
Service (DAaaS) model. It also delves into the features of a
successful data lake and its optimal designing. It goes over
data, applications, and analytics that are strung together to
speed-up the insight brewing process for industry’s
improvements with the help of a powerful architecture for
mining and analyzing unstructured data – data lake.
The document discusses data warehousing concepts including:
1) A data warehouse is a subject-oriented, integrated, and non-volatile collection of data used for decision making. It stores historical and current data from multiple sources.
2) The architecture of a data warehouse is typically three-tiered, with an operational data tier, data warehouse/data mart tier for storage, and client access tier. OLAP servers allow analysis of stored data.
3) ROLAP and MOLAP refer to relational and multidimensional approaches for OLAP. ROLAP dynamically generates data cubes from relational databases, while MOLAP pre-calculates and stores aggregated data in multidimensional structures.
A data warehouse consists of several key components:
- Current detail data from operational systems of record which is stored for analysis.
- Integration and transformation programs that convert operational data into a common format for the data warehouse.
- Summarized and archived data used for reporting and analysis over time.
- Metadata that describes the structure and meaning of the data.
Data warehouses are used for standard reporting, queries on summarized data, and data mining of patterns in large datasets to gain business insights.
Data warehousing combines data from multiple sources into a single database to provide businesses with analytics results from data mining, OLAP, scorecarding and reporting. It extracts, transforms and loads data from operational data stores and data marts into a data warehouse and staging area to integrate and store large amounts of corporate data. Data mining analyzes large databases to extract previously unknown and potentially useful patterns and relationships to improve business processes.
The document defines data mining as extracting useful information from large datasets. It discusses two main types of data mining tasks: descriptive tasks like frequent pattern mining and classification/prediction tasks like decision trees. Several data mining techniques are covered, including association, classification, clustering, prediction, sequential patterns, and decision trees. Real-world applications of data mining are also outlined, such as market basket analysis, fraud detection, healthcare, education, and CRM.
Big data analytics tools from vendors like IBM, Tableau, and SAS can help organizations process and analyze big data. For smaller organizations, Excel is often used, while larger organizations employ data mining, predictive analytics, and dashboards. Business intelligence applications include OLAP, data mining, and decision support systems. Big data comes from many sources like web logs, sensors, social networks, and scientific research. It is defined by the volume, variety, velocity, veracity, variability, and value of the data. Hadoop and MapReduce are common technologies for storing and analyzing big data across clusters of machines. Stream analytics is useful for real-time analysis of data like sensor data.
The document provides an overview of the key components and considerations for building a data warehouse. It discusses 7 main components: 1) the data warehouse database, 2) sourcing, acquisition, cleanup and transformation tools, 3) metadata, 4) access (query) tools, 5) data marts, 6) data warehouse administration and management, and 7) information delivery systems. It also outlines important design considerations, technical considerations, and implementation considerations that must be addressed when building a data warehouse environment.
A Computer database is a collection of logically related data that is stored in a computer system,so that a computer program or person using a query language can use it to answer queries. An operational database (OLTP) contains up-to-date, modifiable application specific data. A data warehouse (OLAP) is a subject-oriented, integrated, time-variant and non-volatile collection of data used to make business decisions. Hadoop Distributed File System (HDFS) allows storing large amount of data on a cloud of
machines. In this paper, we surveyed the literature related to operational databases, data warehouse and hadoop technology.
Infrastructure Considerations for Analytical WorkloadsCognizant
Using Apache Hadoop clusters and Mahout for analyzing big data workloads yields extraordinary performance; we offer a detailed comparison of running Hadoop in a physical vs. virtual infrastructure environment.
The document discusses data warehousing, data mining, and business intelligence applications. It explains that data warehousing organizes and structures data for analysis, and that data mining involves preprocessing, characterization, comparison, classification, and forecasting of data to discover knowledge. The final stage is presenting discovered knowledge to end users through visualization and business intelligence applications.
This document defines a data warehouse as a collection of corporate information derived from operational systems and external sources to support business decisions rather than operations. It discusses the purpose of data warehousing to realize the value of data and make better decisions. Key components like staging areas, data marts, and operational data stores are described. The document also outlines evolution of data warehouse architectures and best practices.
This document provides an overview of key concepts related to decision support systems (DSS) and data warehousing. It defines DSS as interactive computer systems that help decision makers use data, documents, models and communication technologies to identify and solve problems. It then discusses operational databases and how they differ from data warehouses in areas like data type, focus, users and more. Finally, it defines key characteristics of a data warehouse as being subject-oriented, integrated, time-variant and non-volatile to support management decision making.
The Big Data Importance – Tools and their UsageIRJET Journal
This document discusses big data, tools for analyzing big data, and opportunities that big data analytics provides. It begins by defining big data and its key characteristics of volume, variety and velocity. It then discusses tools for storing, managing and processing big data like Hadoop, MapReduce and HDFS. Finally, it outlines how big data analytics can be applied across different domains to enable new insights and informed decision making through analyzing large datasets.
About Streaming Data Solutions for HadoopLynn Langit
This document discusses selecting the best approach for fast big data and streaming analytics projects. It describes key considerations for the architectural design phases such as scalable ingestion, real-time ETL, analytics, alerts and actions, and visualization. Component selection factors include the overall architecture, enterprise-grade streaming engine, ease of use and development, and management/DevOps. The document provides definitions of relevant technologies and compares representative solutions to help identify the best fit based on an organization's needs and skills.
Lecture4 big data technology foundationshktripathy
The document discusses big data architecture and its components. It explains that big data architecture is needed when analyzing large datasets over 100GB in size or when processing massive amounts of structured and unstructured data from multiple sources. The architecture consists of several layers including data sources, ingestion, storage, physical infrastructure, platform management, processing, query, security, monitoring, analytics and visualization. It provides details on each layer and their functions in ingesting, storing, processing and analyzing large volumes of diverse data.
This document discusses web data extraction and analysis using Hadoop. It begins by explaining that web data extraction involves collecting data from websites using tools like web scrapers or crawlers. Next, it describes that the data extracted is often large in volume and requires processing tools like Hadoop for analysis. The document then provides details about using MapReduce on Hadoop to analyze web data in a parallel and distributed manner by breaking the analysis into mapping and reducing phases.
Big Data Processing with Hadoop : A ReviewIRJET Journal
1. This document provides an overview of big data processing with Hadoop. It defines big data and describes the challenges of volume, velocity, variety and variability.
2. Traditional data processing approaches are inadequate for big data due to its scale. Hadoop provides a distributed file system called HDFS and a MapReduce framework to address this.
3. HDFS uses a master-slave architecture with a NameNode and DataNodes to store and retrieve file blocks. MapReduce allows distributed processing of large datasets across clusters through mapping and reducing functions.
The document discusses Big Data architectures and Oracle's solutions for Big Data. It provides an overview of key components of Big Data architectures, including data ingestion, distributed file systems, data management capabilities, and Oracle's unified reference architecture. It describes techniques for operational intelligence, exploration and discovery, and performance management in Big Data solutions.
BD_Architecture and Charateristics.pptx.pdferamfatima43
A big data architecture handles large and complex data through batch processing, real-time processing, interactive exploration, and predictive analytics. It includes data sources, storage, batch and stream processing, an analytical data store, and analysis/reporting tools. Orchestration tools automate workflows that transform data between components. Consider this architecture for large volumes of data, real-time data streams, and machine learning/AI applications. It provides scalability, performance, and integration with existing solutions, though complexity, security, and specialized skills are challenges.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable and distributed processing of large datasets. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for processing vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. HDFS stores data reliably across machines in a Hadoop cluster and MapReduce processes data in parallel by breaking the job into smaller fragments of work executed across cluster nodes.
Big data is characterized by 3 V's - volume, velocity, and variety. It refers to large and complex datasets that are difficult to process using traditional database management tools. Key technologies to handle big data include distributed file systems, Apache Hadoop, data-intensive computing, and tools like MapReduce. Common tools used are infrastructure management tools like Chef and Puppet, monitoring tools like Nagios and Ganglia, and analytics platforms like Netezza and Greenplum.
The document discusses big data analysis and provides an introduction to key concepts. It is divided into three parts: Part 1 introduces big data and Hadoop, the open-source software framework for storing and processing large datasets. Part 2 provides a very quick introduction to understanding data and analyzing data, intended for those new to the topic. Part 3 discusses concepts and references to use cases for big data analysis in the airline industry, intended for more advanced readers. The document aims to familiarize business and management users with big data analysis terms and thinking processes for formulating analytical questions to address business problems.
This document provides an overview of big data. It begins with an introduction that defines big data as massive, complex data sets from various sources that are growing rapidly in volume and variety. It then discusses the brief history of big data and provides definitions, describing big data as data that is too large and complex for traditional data management tools. The document outlines key aspects of big data including the sources, types, applications, and characteristics. It discusses how big data is used in business intelligence to help companies make better decisions. Finally, it describes the key aspects a big data platform must address such as handling different data types, large volumes, and analytics.
Enterprise Data Lake:
How to Conquer the Data Deluge and Derive Insights
that Matters
Data can be traced from various consumer sources.
Managing data is one of the most serious challenges faced
by organizations today. Organizations are adopting the data
lake models because lakes provide raw data that users can
use for data experimentation and advanced analytics.
A data lake could be a merging point of new and historic
data, thereby drawing correlations across all data using
advanced analytics. A data lake can support the self-service
data practices. This can tap undiscovered business value
from various new as well as existing data sources.
Furthermore, a data lake can aid data warehousing,
analytics, data integration by modernizing. However, lakes
also face hindrances like immature governance, user skills
and security.
Gdpr ccpa automated compliance - spark java application features and functi...Steven Meister
GDPR – CCPA Automated Technology, 16 Page PowerPoint with Features, Functions, Architecture and our reasons for choosing them. Be on your way to compliance with Technology created with compliance as its goal. Expect to add years of development without technology built specifically for compliances, such as GDPR, CCPA, HIPAA and others.
After scrolling through this PowerPoint you will realize just what is required and be able to better estimate the efforts it will take for your company to meet these regulatory requirements with technology and then without technology.
Spend just 5-10 minutes that might save your company, and your Customers, all the negative ramifications of the inevitable 2 breaches a year a company can expect to suffer.
This PowerPoint covers the critical aspects and needs that are present in any project designed to meet regulatory requirements for GDPR, CCPA and many others.
Complete Channel of Videos on BigDataRevealed
https://www.youtube.com/watch?v=3rLcQF5Wsgc&list=UU3F-qrvOIOwDj4ZKBMmoTWA
847-440-4439
#CCPA #GDPR #Big Data #Data Compliance #PII #Facebook #Hadoop #AWS #Spark #IoT #California
Big data analytics (BDA) involves examining large, diverse datasets to uncover hidden patterns, correlations, trends, and insights. BDA helps organizations gain a competitive advantage by extracting insights from data to make faster, more informed decisions. It supports a 360-degree view of customers by analyzing both structured and unstructured data sources like clickstream data. Businesses can leverage techniques like machine learning, predictive analytics, and natural language processing on existing and new data sources. BDA requires close collaboration between IT, business users, and data scientists to process and analyze large datasets beyond typical storage and processing capabilities.
Big data is used for structured, unstructured and semi-structured large volume of data which is difficult to
manage and costly to store. Using explanatory analysis techniques to understand such raw data, carefully
balance the benefits in terms of storage and retrieval techniques is an essential part of the Big Data. The
research discusses the Map Reduce issues, framework for Map Reduce programming model and
implementation. The paper includes the analysis of Big Data using Map Reduce techniques and identifying
a required document from a stream of documents. Identifying a required document is part of the security in
a stream of documents in the cyber world. The document may be significant in business, medical, social, or
terrorism.
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond HillClaraZara1
Big data is used for structured, unstructured and semi-structured large volume of data which is difficult to manage and costly to store. Using explanatory analysis techniques to understand such raw data, carefully balance the benefits in terms of storage and retrieval techniques is an essential part of the Big Data. The research discusses the MapReduce issues, framework for MapReduce programming model and implementation. The paper includes the analysis of Big Data using MapReduce techniques and identifying a required document from a stream of documents. Identifying a required document is part of the security in a stream of documents in the cyber world. The document may be significant in business, medical, social, or terrorism.
This document provides an overview of big data fundamentals and considerations for setting up a big data practice. It discusses key big data concepts like the four V's of big data. It also outlines common big data questions around business context, architecture, skills, and presents sample reference architectures. The document recommends starting a big data practice by identifying use cases, gaining management commitment, and setting up a center of excellence. It provides an example use case of retail web log analysis and presents big data architecture patterns.
Become Data Driven With Hadoop as-a-ServiceMammoth Data
This presentation gives an overview of what it means to be a data driven company, all of the pros and cons of becoming data driven, and a few softwares used in data management.
Stream Meets Batch for Smarter Analytics- Impetus White PaperImpetus Technologies
For Impetus’ White Papers archive, visit- http://www.impetus.com/whitepaper
The paper discusses how the traditional batch and real time paradigm can work together to deliver smarter, quicker and better insights on large volumes of data picking the right strategy and right technology.
This document provides an overview of Oracle's Information Management Reference Architecture. It includes a conceptual view of the main architectural components, several design patterns for implementing different types of information management solutions, a logical view of the components in an information management system, and descriptions of how data flows through ingestion, interpretation, and different data layers.
Similar to Hadoop-based architecture approaches (20)
2. 1
Table
of
Contents
EXECUTIVE
SUMMARY
.................................................................................................................................
2
Big
data
Classification
...................................................................................................................................
3
Hadoop-‐based
architecture
approaches
......................................................................................................
5
Data
Lake
..................................................................................................................................................
5
Lambda
.....................................................................................................................................................
5
Choosing
the
correct
architecture
............................................................................................................
5
Data
Lake
Architecture
.................................................................................................................................
9
Generic
Data
lake
Architecture
..............................................................................................................
11
Steps
Involved
....................................................................................................................................
12
Lambda
Architecture
..................................................................................................................................
13
Batch
Layer
.............................................................................................................................................
14
Serving
layer
...........................................................................................................................................
14
Speed
layer
.............................................................................................................................................
14
Generic
Lambda
Architecture
................................................................................................................
16
References
..................................................................................................................................................
17
3. 2
EXECUTIVE
SUMMARY
Apache
Hadoop
didn’t
disrupt
the
datacenter,
the
data
did.
Shortly
after
Corporate
IT
functions
within
enterprises
adopted
large
scale
systems
to
manage
data
then
the
Enterprise
Data
Warehouse
(EDW)
emerged
as
the
logical
home
of
all
enterprise
data.
Today,
every
enterprise
has
a
Data
Warehouse
that
serves
to
model
and
capture
the
essence
of
the
business
from
their
enterprise
systems.
The
explosion
of
new
types
of
data
in
recent
years
–
from
inputs
such
as
the
web
and
connected
devices,
or
just
sheer
volumes
of
records
–
has
put
tremendous
pressure
on
the
EDW.
In
response
to
this
disruption,
an
increasing
number
of
organizations
have
turned
to
Apache
Hadoop
to
help
manage
the
enormous
increase
in
data
whilst
maintaining
coherence
of
the
Data
Warehouse.
This
POV
discusses
Apache
Hadoop,
its
capabilities
as
a
data
platform
and
data
processing.
How
the
core
of
Hadoop
and
its
surrounding
ecosystems
provides
the
enterprise
requirements
to
integrate
alongside
the
Data
Warehouse
and
other
enterprise
data
systems
as
part
of
a
modern
data
architecture.
A
step
on
the
journey
toward
delivering
an
enterprise
‘Data
Lake’
or
Lambda
Architecture
(Immutable
data
+
views).
An
enterprise
data
lake
provides
the
following
core
benefits
to
an
enterprise:
New
efficiencies
for
data
architecture
through
a
significantly
lower
cost
of
storage,
and
through
optimization
of
data
processing
workloads
such
as
data
transformation
and
integration.
New
opportunities
for
business
through
flexible
‘schema-‐on-‐read’
access
to
all
enterprise
data,
and
through
multi-‐use
and
multi-‐workload
data
processing
on
the
same
sets
of
data:
from
batch
to
real-‐time.
Apache
Hadoop
provides
both
reliable
storage
(HDFS)
and
a
processing
system
(MapReduce)
for
large
data
sets
across
clusters
of
computers.
MapReduce
is
a
batch
query
processor
that
is
targeted
at
long-‐
running
background
processes.
Hadoop
can
handle
Volume.
But
to
handle
Velocity,
we
need
real-‐time
processing
tools
that
can
compensate
for
the
high-‐latency
of
batch
systems,
and
serve
the
most
recent
data
continuously,
as
new
data
arrives
and
older
data
is
progressively
integrated
into
the
batch
framework.
And
the
answer
to
the
problem
is
Lambda
Architecture.
4. 3
Big
data
Classification
Processing
Type
Batch
Processing
Methodology
Near
Real
time
Real
Time
+
Batch
Prescriptive
Predictive
Diagnostic
Descriptive
Data
Frequency
On
demand
Continuous
Real
Time
Batch
Data
Type
Transactional
Historical
Master
data
Meta
data
Content
Format
Structured
Unstructured:-‐Images,
Text,
Videos,
Documents,
emails
etc.
Semi-‐Structured:
-‐
XML,
JSON
Data
Sources
Machine
generated
Web
&
Social
media
IOT
Human
Generated
Transactional
data
Via
other
data
providers
5. 4
It's
helpful
to
look
at
the
characteristics
of
the
big
data
along
certain
lines
—
for
example,
how
the
data
is
collected,
analyzed,
and
processed.
Once
the
data
and
its
processing
are
classified,
it
can
be
matched
with
the
appropriate
big
data
analysis
architecture:
• Processing
type
-‐
Whether
the
data
is
analyzed
in
real
time
or
batched
for
later
analysis.
Give
careful
consideration
to
choosing
the
analysis
type,
since
it
affects
several
other
decisions
about
products,
tools,
hardware,
data
sources,
and
expected
data
frequency.
A
mix
of
both
types
‘Near
real
time
or
micro
batch”
may
also
be
required
by
the
use
case.
• Processing
methodology
-‐
The
type
of
technique
to
be
applied
for
processing
data
(e.g.,
predictive,
analytical,
ad-‐hoc
query,
and
reporting).
Business
requirements
determine
the
appropriate
processing
methodology.
A
combination
of
techniques
can
be
used.
The
choice
of
processing
methodology
helps
identify
the
appropriate
tools
and
techniques
to
be
used
in
your
big
data
solution.
• Data
frequency
and
size
-‐
How
much
data
is
expected
and
at
what
frequency
does
it
arrive.
Knowing
frequency
and
size
helps
determine
the
storage
mechanism,
storage
format,
and
the
necessary
preprocessing
tools.
Data
frequency
and
size
depend
on
data
sources:
• On
demand,
as
with
social
media
data
• Continuous
feed,
real-‐time
(weather
data,
transactional
data)
• Time
series
(time-‐based
data)
• Data
type
-‐
Type
of
data
to
be
processed
—
transactional,
historical,
master
data,
and
others.
Knowing
the
data
type
helps
segregate
the
data
in
storage.
• Content
format
-‐
Format
of
incoming
data
—
structured
(RDMBS,
for
example),
unstructured
(audio,
video,
and
images,
for
example),
or
semi-‐structured.
Format
determines
how
the
incoming
data
needs
to
be
processed
and
is
key
to
choosing
tools
and
techniques
and
defining
a
solution
from
a
business
perspective.
• Data
source
-‐
Sources
of
data
(where
the
data
is
generated)
—
web
and
social
media,
machine-‐
generated,
human-‐generated,
etc.
Identifying
all
the
data
sources
helps
determine
the
scope
from
a
business
perspective.
6. 5
Hadoop-‐based
architecture
approaches
Data
Lake
A
data
lake
is
a
set
of
centralized
repositories
containing
vast
amounts
of
raw
data
(either
structured
or
unstructured),
described
by
metadata,
organized
into
identifiable
data
sets,
and
available
on
demand.
Data
in
the
lake
supports
discovery,
analytics,
and
reporting,
usually
by
deploying
cluster
tools
like
Hadoop.
Lambda
Lambda
architecture
is
a
data-‐processing
architecture
designed
to
handle
massive
quantities
of
data
by
taking
advantage
of
both
batch-‐
and
stream-‐processing
methods.
This
approach
to
architecture
attempts
to
balance
latency,
throughput,
and
fault-‐tolerance
by
using
batch
processing
to
provide
comprehensive
and
accurate
views
of
batch
data,
while
simultaneously
using
real-‐time
stream
processing
to
provide
views
of
online
data.
The
two
view
outputs
may
be
joined
before
presentation.
The
rise
of
lambda
architecture
is
correlated
with
the
growth
of
big
data,
real-‐time
analytics,
and
the
drive
to
mitigate
the
latencies
of
map-‐reduce.
Choosing
the
correct
architecture
7. 6
Parameter
Data
Lake
Lambda
Simultaneous
access
to
Real
time
and
Batch
data
Data
Lake
can
use
real
time
processing
technologies
like
Storm
to
return
real
time
results,
however
in
such
a
scenario
historical
results
cannot
be
made
available.
If
we
use
technologies
like
Spark
to
process
data,
real
time
data
and
historical
data,
on
request
there
can
be
significant
delays
in
response
time
to
clients
as
compared
to
Lambda
architecture.
Lambda
Architecture’s
Serving
Layer
merges
the
output
of
Batch
Layer
and
Speed
Layer,
before
sending
the
results
of
user
queries.
As
data
is
already
processed
into
views
at
both
the
layers,
the
response
time
is
significantly
less.
Latency
Latency
is
high
as
compared
to
Lambda,
as
real
time
data
need
to
be
processed
with
historical
data
on-‐demand
or
as
a
part
of
batch.
Low-‐latency
real
time
results
are
processed
by
Speed
layer
and
Batch
results
are
pre-‐
processed
in
Batch
layer.
On
request,
both
the
results
are
just
merged,
there
by
resulting
low
latency
time
for
real
time
processing.
Ease
of
Data
Governance
Data
lake
is
coined
to
convey
the
concept
of
centralized
repository
containing
virtually
inexhaustible
amounts
of
raw
data
(or
minimally
curated)
data
that
is
readily
made
available
anytime
to
anyone
authorized
to
perform
analytical
activities.
Lambda
architecture’s
serving
layer
gives
access
to
processed
and
analyzed
data.
As
uses
get
access
to
processed
data
directly,
it
can
lead
to
top
down
data
governance
issues.
Updates
in
source
data
As
data
lake
stores
only
raw
data,
updates
are
just
appended
to
raw
data,
thereby
makes
life
of
business
users
difficult
to
write
business
logic,
in
such
a
way
that
latest
updated
records
are
considered
in
calculations.
Batch
Views
are
always
computed
from
starch
in
Lambda
Architecture.
As
a
result,
updates
can
be
easily
incorporated
in
calculated
Views
in
each
reprocess
batch
cycle.
Fault
tolerance
against
human
errors
Data
Scientist
or
business
users,
running
business
logic
on
relevant
raw
data
in
Data
Lake
might
lead
to
human
errors.
Although,
re-‐covering
from
those
errors
is
not
difficult
as
it’s
just
a
matter
of
re-‐running
the
logic.
However,
the
reprocessing
time
for
large
datasets
might
lead
to
some
delays.
Lambda
architecture
assures
fault
tolerance
not
only
against
hardware
failures
but
against
human
errors.
Re-‐computation
of
views
every
time
from
raw
data
in
batch
layer,
insures
that
any
human
errors
in
business
logic
would
not
be
cascaded
to
a
level
where
it’s
unrecoverable.
Ease
of
business
users
Data
is
stored
in
raw
format,
Data
is
processed
and
available
8. 7
with
data
definitions
and
sometime
groomed
to
make
digestible
by
data
management
tools.
At
times,
it
difficult
for
business
users
to
use
data
in
as-‐
is
conditions.
from
Serving
makes
life
easy
for
business
users.
Accuracy
for
real
time
results
Irrespective
of
any
scenario,
users
accessing
data
from
Data
Lake
has
access
to
immutable
raw
data,
they
can
do
exact
computations,
thereby
always
get
the
accurate
results.
In
scenarios,
where
real
time
calculations
need
to
access
historical
data,
which
is
not
possible,
Lambda
architecture
would
return
you
estimated
results.
For
example,
calculation
of
mean
value,
cannot
be
achieved
until
whole
historical
data
and
real
time
data
is
referenced
at
one
go.
In
such
a
scenario,
serving
layer
would
return
estimated
results.
Infrastructure
Cost
Data
lake
architecture
process
the
data
as
and
when
need
and
thereby
the
cluster
cost
can
be
much
less
as
compared
to
Lambda.
Moreover,
it
only
persist
the
raw
data
however
Lambda
architecture
not
only
persist
the
raw
data
but
processed
data
too.
This
leads
to
extra
storage
cost
in
Lambda
architecture.
Lambda
architecture
data
processing
life
cycle
is
designed
in
such
a
fashion
that
as
soon
the
one
cycle
of
batch
process
is
finished,
it
starts
a
new
cycle
of
batch
processing
which
includes
the
recently
inserted
data.
Simultaneously,
the
speed
layer
is
always
processing
the
real
time
data.
OLAP
Unlike
data
marts,
which
are
optimized
for
data
analysis
by
storing
only
some
attributes
and
dropping
data
below
the
level
aggregation,
a
data
lake
is
designed
to
retain
all
attributes,
especially
so
when
you
do
not
yet
know
what
the
scope
of
data
or
its
use
will
be.
As
Lambda
exposes
the
processed
views
from
serving
layer,
all
the
attributes
of
data
would
not
be
available
to
Data
Scientist
for
running
an
analytical
queries
at
times.
Historical
data
reference
for
processing
OLAP
&
OLTP
queries
access
the
raw
or
groomed
data
directly
from
the
data
lake,
making
it
feasible
to
access
and
refer
the
historical
data
while
processing
data
for
given
time
interval.
Speed
layer
do
not
have
reference
to
historical
data
stored
in
batch
layer,
make
it
difficult
to
run
queries
which
refer
historical
data.
For
e.g.
‘Unique
count’
type
of
queries
cannot
return
correct
results
from
Speed
layer.
However,
‘calculating
average’
type
of
9. 8
query
calculations
be
done
easily
on
Serving
layer,
by
generating
the
average
of
results
returned
from
Speed
and
Batch
layer
on
the
fly.
Slowly
Changing
Dimensions
Although,
data
lake
has
records
of
changed
dimension
attributes,
however
extra
business
logic
need
to
be
written
by
business
uses
to
cater
it.
Lambda
architecture
can
easily
cater
the
slowly
changing
dimensions
by
creating
surrogate
keys
parallel
to
natural
keys
in
case
of
any
change
detected
in
dimension
attributes
while
batch
layer
processing
cycle.
Slowly
changing
Facts
However,
in
Data
Lake
both
the
versions
of
facts
are
available
for
users
to
look
at,
this
would
lead
to
good
analytical
results
if
fact
life
cycle
is
an
attribute
in
business
logic
for
data
analytics.
Although
it’s
easy
to
change
the
facts
in
Lambda
architecture,
but
this
will
lead
to
loss
in
information
of
fact
life
cycle.
As
the
previous
state
of
fact
in
case
of
slowly
changing
facts
is
not
available
to
Data
Scientist,
the
analytical
queries
might
not
give
desired
results
on
views
exposed
by
Serving
Layer.
Frequently
changing
business
logic
Changes
in
processing
code
need
to
be
done.
But
there
is
no
clear
solution,
of
how
the
historically
processed
data
need
to
be
handled.
As
data
is
re-‐processed
from
starch,
even
if
business
logic
changes
frequently
the
historical
data
problem
is
resolved
automatically.
Implementation
lifecycle
Data
lake
is
fast
to
implement
as
it
eliminates
the
dependency
of
data
modeling
upfront
Processing
logic
need
to
be
implemented
at
batch
and
speed
layer,
leading
to
significant
implementation
time
as
comparted
to
Data
Lake
Adding
new
data
sources
Very
easy
to
add
Need
to
be
incorporated
in
processing
layers
and
would
require
code
changes
10. 9
IF
YOU
THINK
OF
A
DATAMART
AS
A
STORE
OF
BOTTLED
WATER
–
CLEANSED
AND
PACKAGED
AND
STRUCTURED
FOR
EASY
CONSUMPTION
–
THE
DATA
LAKE
IS
A
LARGE
BODY
OF
WATER
IN
A
MORE
NATURAL
STATE.
THE
CONTENTS
OF
THE
DATA
LAKE
STREAM
IN
FROM
A
SOURCE
TO
FILL
THE
LAKE,
AND
VARIOUS
USERS
OF
THE
LAKE
CAN
COME
TO
EXAMINE,
DIVE
IN,
OR
TAKE
SAMPLES.
BY:
JAMES
DIXON
(PENTAHO
CTO)
Data
Lake
Architecture
Much
of
today's
research
and
decision
making
are
based
on
knowledge
and
insight
that
can
be
gained
from
analyzing
and
contextualizing
the
vast
(and
growing)
amount
of
“open”
or
“raw”
data.
The
concept
that
the
large
number
of
data
sources
available
today
facilitates
analyses
on
combinations
of
heterogeneous
information
that
would
not
be
achievable
via
“siloed”
data
maintained
in
warehouses
is
very
powerful.
The
term
data
lake
has
been
coined
to
convey
the
concept
of
a
centralized
repository
containing
virtually
inexhaustible
amounts
of
raw
(or
minimally
curated)
data
that
is
readily
made
available
anytime
to
anyone
authorized
to
perform
analytical
activities.
A
data
lake
is
a
set
of
centralized
repositories
containing
vast
amounts
of
raw
data
(either
structured
or
unstructured),
described
by
metadata,
organized
into
identifiable
data
sets,
and
available
on
demand.
Data
in
the
lake
supports
discovery,
analytics,
and
reporting,
usually
by
deploying
cluster
tools
like
Hadoop.
Unlike
traditional
warehouses,
the
format
of
the
data
is
not
described
(that
is,
its
schema
is
not
available)
until
the
data
is
needed.
By
delaying
the
categorization
of
data
from
the
point
of
entry
to
the
point
of
use,
analytical
operations
that
transcend
the
rigid
format
of
an
adopted
schema
become
possible.
Query
and
search
operations
on
the
data
can
be
performed
using
traditional
database
technologies
(when
structured),
as
well
as
via
alternate
means
such
as
indexing
and
NoSQL
derivatives.
Key
Features
• Stores
Raw
data
–
Single
source
of
truth
• Data
accessible
to
anyone
authorized
• Polyglot
Persistence
• Support
multiple
applications
&
Workloads
• Low
Cost,
High
Performance
storage
• Flexible,
easy
to
use
data
organization
• Self-‐service
end-‐user
• More
Flexible
to
answer
new
questions
• Easy
to
add
new
data
sources
• Loosely
coupled
architecture
–
enables
flexibility
of
analysis
• Eliminating
dependency
of
data
modeling
upfront
–
thereby
fast
to
implement
• Storage
is
highly
optimized
as
raw
data
is
stored
Disadvantages
• High
Latency
for
composite
analysis
view
of
both
real
time
and
historical
data
• Raw
data
does
not
provide
relational
structure
that
is
not
friendly
for
business
analytis
on
the
fly
11. 10
In
a
practical
sense,
a
data
lake
is
characterized
by
three
key
attributes:
• Collect
everything:
A
data
lake
contains
all
data,
both
raw
sources
over
extended
periods
of
time
as
well
as
any
processed
data.
• Dive
in
anywhere:
A
data
lake
enables
users
across
multiple
business
units
to
refine,
explore
and
enrich
data
on
their
terms.
• Flexible
access:
A
data
lake
enables
multiple
data
access
patterns
across
a
shared
infrastructure:
batch,
interactive,
online,
search,
in-‐memory
and
other
processing
engines.
12. 11
Generic
Data
lake
Architecture
H
Data
Sources
Real
Time
Micro
Batch
Mega
Batch
Desktop
&
Mobile
Social
Media
and
cloud
Operational
Systems
Internet
of
Things
(IOT)
Ingestion
Tier
Query
Interface
SQL
No
SQL
Extern
al
Storag
e
Centralized
Management
System
monitoring System
management
Unified
Data
Management
Tier
Data
mgmt. Data
Access
Processing
Tier
Workflow
Management
HDFS
storage
Unstructured
and
structured
data
In-‐memory
MapReduce/
Hive/MPP
Flexible
Actions
Real-‐time
insights
Interactive
insights
Batch
insights
Schematic
Metadata Grooming
Data
Processed
Data
Raw
Data
Processed
Data
Processed
Data
13. 12
Steps
Involved
• Procuring
data
–
Process
of
obtaining
data
and
metadata
and
preparing
them
for
eventual
inclusion
in
a
data
lake.
• Obtaining
data
–Transferring
the
data
physically
from
source
to
Data
Lake.
• Describing
data
–
Data
scientist
searching
a
data
lake
for
useful
data
must
be
able
to
find
the
data
relevant
to
his
or
her
need,
for
the
same
they
require
metadata
for
the
data.
Schematic
metadata
for
this
data
set
would
include
information
about
how
the
data
is
formatted
and
information
about
the
schema.
• Grooming
data
–
Although
we
were
talking
about
raw
data
is
made
consumable
by
analytics
applications.
However,
in
some
scenarios
grooming
process
use
schematic
metadata
to
transform
raw
data,
into
data
that
can
be
processed
by
standard
data
management
tools.
• Provisioning
data
–
Authentication
and
authorization
policies
by
which
consumers
take
out
data
from
Data
Lake.
• Preserving
data
–
Managing
Data
Lake
also
require
attention
to
maintenance
issues
such
as
staleness,
expiration,
decommissions
and
renewals.
14. 13
LAMBDA
ARCHITECTURE
IS
A
DATA-‐
PROCESSING
ARCHITECTURE
DESIGNED
TO
HANDLE
MASSIVE
QUANTITIES
OF
DATA
BY
TAKING
ADVANTAGE
OF
BOTH
BATCH-‐
AND
STREAM-‐PROCESSING
METHODS.
THIS
APPROACH
TO
ARCHITECTURE
ATTEMPTS
TO
BALANCE
LATENCY,
THROUGHPUT,
AND
FAULT-‐TOLERANCE
BY
USING
BATCH
PROCESSING
TO
PROVIDE
COMPREHENSIVE
AND
ACCURATE
VIEWS
OF
BATCH
DATA,
WHILE
SIMULTANEOUSLY
USING
REAL-‐TIME
STREAM
PROCESSING
TO
PROVIDE
VIEWS
OF
ONLINE
DATA.
THE
TWO
VIEW
OUTPUTS
MAY
BE
JOINED
BEFORE
PRESENTATION.
Lambda
Architecture
The
Lambda
architecture
is
split
into
three
layers,
the
batch
layer,
the
serving
layer,
and
the
speed
layer.
1. Batch layer (Apache Hadoop)
2. Serving layer (Cloudera Impala,
Spark)
3. Speed layer (Storm, Spark,
Apache HBase, Cassandra)
Key
Features
• Low
latency
simultaneous
analysis
of
the
(near)
real-‐
time
information
extracted
from
a
continuous
inflow
of
data
and
persisting
analysis
of
a
massive
volume
of
data.
• Fault
tolerant
not
against
hardware
failure
but
against
human
error
too
• Mistakes
are
corrected
by
re-‐computations
• Storage
is
highly
optimized
as
raw
data
is
stored
15. 14
Batch
Layer
The
batch
layer
is
responsible
for
two
things.
The
first
is
to
store
the
immutable,
constantly
growing
master
dataset
(HDFS),
and
the
second
is
to
compute
arbitrary
views
from
this
dataset
(MapReduce).
Computing
the
views
is
a
continuous
operation,
so
when
new
data
arrives
it
will
be
aggregated
into
the
views
when
they
are
recomputed
during
the
next
MapReduce
iteration.
The
views
should
be
computed
from
the
entire
dataset
and
therefore
the
batch
layer
is
not
expected
to
update
the
views
frequently.
Depending
on
the
size
of
your
dataset
and
cluster,
each
iteration
could
take
hours.
Serving
layer
The
output
from
the
batch
layer
is
a
set
of
flat
files
containing
the
precomputed
views.
The
serving
layer
is
responsible
for
indexing
and
exposing
the
views
so
that
they
can
be
queried.
Although,
the
batch
and
serving
layers
alone
do
not
satisfy
any
realtime
requirement
because
MapReduce
(by
design)
is
high
latency
and
it
could
take
a
few
hours
for
new
data
to
be
represented
in
the
views
and
propagated
to
the
serving
layer.
This
is
why
we
need
the
speed
layer.
Speed
layer
In
essence
the
speed
layer
is
the
same
as
the
batch
layer
in
that
it
computes
views
from
the
data
it
receives.
The
speed
layer
is
needed
to
compensate
for
the
high
latency
of
the
batch
layer
and
it
does
this
by
computing
realtime
views
in
Storm.
The
realtime
views
contain
only
the
delta
results
to
supplement
the
batch
views.
Whilst
the
batch
layer
is
designed
to
continuously
recompute
the
batch
views
from
scratch,
the
speed
layer
uses
an
incremental
model
whereby
the
realtime
views
are
incremented
as
and
when
new
data
is
received.
What’s
clever
about
the
speed
layer
is
the
realtime
views
are
intended
to
be
transient
and
as
soon
as
the
data
propagates
through
the
batch
and
serving
layers
the
corresponding
results
in
the
Disadvantages
• Maintaining
copies
code
that
needs
to
produce
the
same
result
in
two
complex
distributed
systems
• Could
return
estimated
or
approx.
results.
• Expensive
full
recomputation
is
required
for
fault
tolerance
• Requires
high
cluster
up-‐time,
as
batch
data
need
to
be
processed
continuously.
• Requires
more
implementation
time,
as
duplicate
code
need
to
be
written
in
separate
technologies
to
process
real
time
and
batch
data.
• Time
taken
to
process
batch
is
linearly
16. 15
realtime
views
can
be
discarded.
This
is
referred
to
as
“complexity
isolation”,
meaning
that
the
most
complex
part
of
the
architecture
is
pushed
into
the
layer
whose
results
are
only
temporary.
Realtime
views
are
discarded
once
the
data
they
contain
is
represented
in
batch
view
Now
Batch
Batch
Batch
Realtime
Realtime
Realtime
Time
17. 16
Generic
Lambda
Architecture
Batch
Layer
Serving
Layer
Speed
Layer
All
Data
(HDFS)
Pre-‐computed
Views
&
Summarized
data
Batch
Precompute
Data
Stream
Data
Stream
Data
Stream
Data
Stream
Process
Stream
Increment
views
/
Stream
Summarization
Query
V
V
V
V
V
V
Near
real
time
-‐
Increment
Real
time
views
Batch
Views
Storm
or
Spark
MR
/
Hive/
Pig
Data
Management
&
Access