Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It was designed to scale up from single servers to thousands of machines, with very high fault tolerance. The document provides an overview of Hadoop, including its evolution from centralized mainframe systems, the motivation for its development, its core architecture components like HDFS and YARN, common technologies that operate within the Hadoop ecosystem like Hive and Spark, and how businesses can use it for tasks like data analytics and business intelligence.
Talked to the Students of 'Big Data Optimization Certificate' of University of Delaware:
Contents of the Talk:
EVOLUTION AND EXPANSION OF BUSINESS DATA PROCESSING
MOTIVATION BEHIND HADOOP
HADOOP ARCHITECTURE
HADOOP TECHNOLOGIES AND USAGES
DATA WRANGLING ON HADOOP
BUSINESS INTELLIGENCE AND ANALYTICS ON HADOOP
This document provides an overview of big data and Hadoop. It discusses what big data is, its types including structured, semi-structured and unstructured data. Some key sources of big data are also outlined. Hadoop is presented as a solution for managing big data through its core components like HDFS for storage and MapReduce for processing. The Hadoop ecosystem including other related tools like Hive, Pig, Spark and YARN is also summarized. Career opportunities in working with big data are listed in the end.
1. The document describes building an analytical platform for a retailer by using open source tools R and RStudio along with SAP Sybase IQ database.
2. Key aspects included setting up SAP Sybase IQ as a column-store database for storage and querying of data, implementing R and RStudio for statistical analysis, and automating running of statistical models on new data.
3. The solution provided a low-cost platform capable of rapid prototyping of analytical models and production use for predictive analytics.
Architecting Big Data Ingest & ManipulationGeorge Long
Here's the presentation I gave at the KW Big Data Peer2Peer meetup held at Communitech on 3rd November 2015.
The deck served as a backdrop to the interactive session
http://www.meetup.com/KW-Big-Data-Peer2Peer/events/226065176/
The scope was to drive an architectural conversation about :
o What it actually takes to get the data you need to add that one metric to your report/dashboard?
o What's it like to navigate the early conversations of an analytic solution?
o How is one technology selected over another and how do those selections impact or define other selections?
Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)Andreas Buckenhofer
Part 3(4)
The slides contain a DWH lecture given for students in 5th semester. Content:
- Introduction DWH and Business Intelligence
- DWH architecture
- DWH project phases
- Logical DWH Data Model
- Multidimensional data modeling
- Data import strategies / data integration / ETL
- Frontend: Reporting and anaylsis, information design
- OLAP
The presentation compares Data Lakes with classical DWHs. Topics like schema-on-read, schema-on-write, security, JSON, data modeling, data integration are covered.
Part 2 - Data Warehousing Lecture at BW Cooperative State University (DHBW)Andreas Buckenhofer
The document provides information about Andreas Buckenhofer and Daimler TSS. It discusses Daimler TSS's locations, what attendees will learn about data warehouse data modeling and OLAP, and an overview of data modeling for OLTP applications, Codd's normal forms, and dimensional modeling for data marts.
Hadoop is a scalable data storage system that stores large amounts of data across computer clusters, allowing for additional computing resources to be easily added as data needs increase. In contrast, a relational database management system (RDBMS) stores structured data in tables that contain rows and columns, with relationships defined by primary and foreign keys. RDBMSs are better suited to applications with predictable data workloads, while Hadoop's scalability makes it more flexible for companies facing fluctuating or growing data needs.
Talked to the Students of 'Big Data Optimization Certificate' of University of Delaware:
Contents of the Talk:
EVOLUTION AND EXPANSION OF BUSINESS DATA PROCESSING
MOTIVATION BEHIND HADOOP
HADOOP ARCHITECTURE
HADOOP TECHNOLOGIES AND USAGES
DATA WRANGLING ON HADOOP
BUSINESS INTELLIGENCE AND ANALYTICS ON HADOOP
This document provides an overview of big data and Hadoop. It discusses what big data is, its types including structured, semi-structured and unstructured data. Some key sources of big data are also outlined. Hadoop is presented as a solution for managing big data through its core components like HDFS for storage and MapReduce for processing. The Hadoop ecosystem including other related tools like Hive, Pig, Spark and YARN is also summarized. Career opportunities in working with big data are listed in the end.
1. The document describes building an analytical platform for a retailer by using open source tools R and RStudio along with SAP Sybase IQ database.
2. Key aspects included setting up SAP Sybase IQ as a column-store database for storage and querying of data, implementing R and RStudio for statistical analysis, and automating running of statistical models on new data.
3. The solution provided a low-cost platform capable of rapid prototyping of analytical models and production use for predictive analytics.
Architecting Big Data Ingest & ManipulationGeorge Long
Here's the presentation I gave at the KW Big Data Peer2Peer meetup held at Communitech on 3rd November 2015.
The deck served as a backdrop to the interactive session
http://www.meetup.com/KW-Big-Data-Peer2Peer/events/226065176/
The scope was to drive an architectural conversation about :
o What it actually takes to get the data you need to add that one metric to your report/dashboard?
o What's it like to navigate the early conversations of an analytic solution?
o How is one technology selected over another and how do those selections impact or define other selections?
Part 3 - Data Warehousing Lecture at BW Cooperative State University (DHBW)Andreas Buckenhofer
Part 3(4)
The slides contain a DWH lecture given for students in 5th semester. Content:
- Introduction DWH and Business Intelligence
- DWH architecture
- DWH project phases
- Logical DWH Data Model
- Multidimensional data modeling
- Data import strategies / data integration / ETL
- Frontend: Reporting and anaylsis, information design
- OLAP
The presentation compares Data Lakes with classical DWHs. Topics like schema-on-read, schema-on-write, security, JSON, data modeling, data integration are covered.
Part 2 - Data Warehousing Lecture at BW Cooperative State University (DHBW)Andreas Buckenhofer
The document provides information about Andreas Buckenhofer and Daimler TSS. It discusses Daimler TSS's locations, what attendees will learn about data warehouse data modeling and OLAP, and an overview of data modeling for OLTP applications, Codd's normal forms, and dimensional modeling for data marts.
Hadoop is a scalable data storage system that stores large amounts of data across computer clusters, allowing for additional computing resources to be easily added as data needs increase. In contrast, a relational database management system (RDBMS) stores structured data in tables that contain rows and columns, with relationships defined by primary and foreign keys. RDBMSs are better suited to applications with predictable data workloads, while Hadoop's scalability makes it more flexible for companies facing fluctuating or growing data needs.
The document provides information about an international business course for a BBA program at Bangladesh University of Business and Technology. It includes the course title, code, and submitting students' names and details. It then discusses the history of databases and database management systems from the 1960s to present day.
From big data to big value : Infrastructure need and Huawei best practise BSP Media Group
This document discusses Huawei's big data infrastructure solutions and best practices. It summarizes that traditional infrastructure cannot scale to meet big data needs, which require scaling capacity, bandwidth, and throughput on demand. Huawei's strategy is to provide an intelligent, application-aware platform that natively supports multiple workloads through integrated storage, analysis, and archiving functions. The document highlights Huawei's OceanStor 9000 storage platform, which offers leading performance and scalability through a distributed architecture, and its enterprise-level Hadoop platform.
The other Apache Technologies your Big Data solution needsgagravarr
The document discusses many Apache projects relevant to big data solutions, including projects for loading and querying data like Pig and Gora, building MapReduce jobs like Avro and Thrift, cloud computing with LibCloud and DeltaCloud, and extracting information from unstructured data with Tika, UIMA, OpenNLP, and cTakes. It also mentions utility projects like Chemistry, JMeter, Commons, and ManifoldCF.
This document provides an overview of a game plan for analyzing malware. It will include a theoretical overview today followed by detailed presentations on virtualization, honeypots/honeynets, debugging, and more. It discusses setting up a controlled lab environment for analysis including static analysis, network traffic analysis, disk/file system analysis, and memory analysis. It also discusses various tools that can be used for each part of the analysis process.
This document discusses how Facebook uses big data and various technologies like Hadoop, Hive, Memcached, Varnish Cache, Scribe, and Haystack to scale their platforms and processes massive amounts of user data. It provides details on Facebook's architecture and how they have overcome scaling challenges. It also discusses technologies like LAMP stack, HipHop, and Open Compute Project that Facebook has utilized.
This document discusses big data processing on the cloud. It describes the 3V model of big data, which refers to volume, velocity and variety. It discusses operational challenges of big data and how distributed processing tools like Hadoop and Spark can help address these challenges. It provides an example use case of using Apache Spark on Amazon Web Services along with Apache Zeppelin and Amazon S3 for distributed data analytics.
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
This talk, given at Berlin Buzzwords 2019, describes the recent progress in making Hopsworks a cloud-native platform, with HA data-center support added for HopsFS.
R is an open source programming language and software environment for statistical analysis and graphics. It is widely used among data scientists for tasks like data manipulation, calculation, and graphical data analysis. Some key advantages of R include that it is open source and free, has a large collection of statistical tools and packages, is flexible, and has strong capabilities for data visualization. It also has an active user community and can integrate with other software like SAS, Python, and Tableau. R is a popular and powerful tool for data scientists.
Building a scalable analytics environment to support diverse workloadsAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Building a scalable analytics environment to support diverse workloads
Tom Panozzo, Chief Technology Officer (Aunalytics)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDATAVERSITY
Architecture matters. That's why today's innovators are taking a hard look at streaming data, an increasingly attractive option that can transform business in several ways: replacing aging data ingestion techniques like ETL; solving long-standing data quality challenges; improving business processes ranging from sales and marketing to logistics and procurement; or any number of activities related to accelerating data warehousing, business intelligence and analytics.
Register for this DM Radio Deep Dive Webinar to learn how streaming data can rejuvenate or supplant traditional data management practices. Host Eric Kavanagh will explain how streaming-first architectures can relieve data engineers from time-consuming, error-prone processes, ideally bidding farewell to those unpleasant batch windows. He'll be joined by Kevin Petrie of Attunity, who will explain why (with real-world story successes) streaming data solutions can keep the business fueled with trusted data in a timely, efficient manner for improved business outcomes.
1. The customer asked the author to build an analytical platform to store data in a database and perform statistical analysis from a front-end interface.
2. The author chose an SAP Sybase IQ column-store database to store data, the open-source R programming language to perform statistical analysis, and RStudio as the front-end interface.
3. The solution provided a simple way to load and query large amounts of data, automated running of statistical models, and could be deployed in the cloud.
This document provides an overview of Red Hat Storage 2.1. It discusses how Red Hat Storage is an open, software-defined storage platform designed for modern hybrid datacenters. It also outlines key capabilities of Red Hat Storage 2.1 such as improved performance, geo-replication, SMB support, and unified deployment and management. The document highlights how Red Hat Storage provides scalable, available, and cost-effective storage while ensuring data protection and hybrid cloud capabilities.
The document discusses modernizing enterprise data warehouses by using a Hadoop data lake solution with EMC Isilon storage. This provides benefits like offloading expensive ETL processing to reduce costs, archiving cold data for cheaper storage, and enabling analytics on new data sources like semi-structured data. The solution leverages Hortonworks Data Platform for open, interoperable analytics and provides enterprise-grade data management capabilities on Hadoop at lower costs than traditional EDWs.
At the Public Sector Red Hat Storage Days on 1/20/16 and 1/21/16, Jason Calloway walked attendees through the basics of scalable POSIX file systems in the cloud.
The document discusses the NoSQL movement and non-relational databases. It provides background on the limitations of relational databases that led to the development of NoSQL databases. Examples of NoSQL databases are described like Voldemort, CouchDB, and Cassandra. Benefits of NoSQL databases include horizontal scaling, high availability, and faster performance.
ارائه در زمینه کلان داده،
کارگاه آموزشی "عصر کلان داده، چرا و چگونه؟" در بیست و دومین کنفرانس انجمن کامپیوتر ایران csicc2017.ir
وحید امیری
vahidamiry.ir
datastack.ir
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016StampedeCon
The document discusses using a data lake approach with EMC Isilon storage to address various business use cases. It describes how the solution provides shared storage for multiple workloads through multi-protocol support, enables data protection and isolation of client data, and allows testing applications across Hadoop distributions through a common platform. Examples are given of how this approach supports an enterprise data hub, data warehouse offloading, data integration, and enrichment services.
This document provides an overview of cloud architecture best practices and Amazon Web Services. It discusses the advantages of using cloud stacks over individual physical servers, as cloud stacks provide scalability, redundancy, and a pay-as-you-use model. The document then outlines various AWS services for computing, storage, databases, security, and management and explains how AWS allows horizontal scaling of servers and storage as needed.
The document provides information about an international business course for a BBA program at Bangladesh University of Business and Technology. It includes the course title, code, and submitting students' names and details. It then discusses the history of databases and database management systems from the 1960s to present day.
From big data to big value : Infrastructure need and Huawei best practise BSP Media Group
This document discusses Huawei's big data infrastructure solutions and best practices. It summarizes that traditional infrastructure cannot scale to meet big data needs, which require scaling capacity, bandwidth, and throughput on demand. Huawei's strategy is to provide an intelligent, application-aware platform that natively supports multiple workloads through integrated storage, analysis, and archiving functions. The document highlights Huawei's OceanStor 9000 storage platform, which offers leading performance and scalability through a distributed architecture, and its enterprise-level Hadoop platform.
The other Apache Technologies your Big Data solution needsgagravarr
The document discusses many Apache projects relevant to big data solutions, including projects for loading and querying data like Pig and Gora, building MapReduce jobs like Avro and Thrift, cloud computing with LibCloud and DeltaCloud, and extracting information from unstructured data with Tika, UIMA, OpenNLP, and cTakes. It also mentions utility projects like Chemistry, JMeter, Commons, and ManifoldCF.
This document provides an overview of a game plan for analyzing malware. It will include a theoretical overview today followed by detailed presentations on virtualization, honeypots/honeynets, debugging, and more. It discusses setting up a controlled lab environment for analysis including static analysis, network traffic analysis, disk/file system analysis, and memory analysis. It also discusses various tools that can be used for each part of the analysis process.
This document discusses how Facebook uses big data and various technologies like Hadoop, Hive, Memcached, Varnish Cache, Scribe, and Haystack to scale their platforms and processes massive amounts of user data. It provides details on Facebook's architecture and how they have overcome scaling challenges. It also discusses technologies like LAMP stack, HipHop, and Open Compute Project that Facebook has utilized.
This document discusses big data processing on the cloud. It describes the 3V model of big data, which refers to volume, velocity and variety. It discusses operational challenges of big data and how distributed processing tools like Hadoop and Spark can help address these challenges. It provides an example use case of using Apache Spark on Amazon Web Services along with Apache Zeppelin and Amazon S3 for distributed data analytics.
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
This talk, given at Berlin Buzzwords 2019, describes the recent progress in making Hopsworks a cloud-native platform, with HA data-center support added for HopsFS.
R is an open source programming language and software environment for statistical analysis and graphics. It is widely used among data scientists for tasks like data manipulation, calculation, and graphical data analysis. Some key advantages of R include that it is open source and free, has a large collection of statistical tools and packages, is flexible, and has strong capabilities for data visualization. It also has an active user community and can integrate with other software like SAS, Python, and Tableau. R is a popular and powerful tool for data scientists.
Building a scalable analytics environment to support diverse workloadsAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Building a scalable analytics environment to support diverse workloads
Tom Panozzo, Chief Technology Officer (Aunalytics)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDATAVERSITY
Architecture matters. That's why today's innovators are taking a hard look at streaming data, an increasingly attractive option that can transform business in several ways: replacing aging data ingestion techniques like ETL; solving long-standing data quality challenges; improving business processes ranging from sales and marketing to logistics and procurement; or any number of activities related to accelerating data warehousing, business intelligence and analytics.
Register for this DM Radio Deep Dive Webinar to learn how streaming data can rejuvenate or supplant traditional data management practices. Host Eric Kavanagh will explain how streaming-first architectures can relieve data engineers from time-consuming, error-prone processes, ideally bidding farewell to those unpleasant batch windows. He'll be joined by Kevin Petrie of Attunity, who will explain why (with real-world story successes) streaming data solutions can keep the business fueled with trusted data in a timely, efficient manner for improved business outcomes.
1. The customer asked the author to build an analytical platform to store data in a database and perform statistical analysis from a front-end interface.
2. The author chose an SAP Sybase IQ column-store database to store data, the open-source R programming language to perform statistical analysis, and RStudio as the front-end interface.
3. The solution provided a simple way to load and query large amounts of data, automated running of statistical models, and could be deployed in the cloud.
This document provides an overview of Red Hat Storage 2.1. It discusses how Red Hat Storage is an open, software-defined storage platform designed for modern hybrid datacenters. It also outlines key capabilities of Red Hat Storage 2.1 such as improved performance, geo-replication, SMB support, and unified deployment and management. The document highlights how Red Hat Storage provides scalable, available, and cost-effective storage while ensuring data protection and hybrid cloud capabilities.
The document discusses modernizing enterprise data warehouses by using a Hadoop data lake solution with EMC Isilon storage. This provides benefits like offloading expensive ETL processing to reduce costs, archiving cold data for cheaper storage, and enabling analytics on new data sources like semi-structured data. The solution leverages Hortonworks Data Platform for open, interoperable analytics and provides enterprise-grade data management capabilities on Hadoop at lower costs than traditional EDWs.
At the Public Sector Red Hat Storage Days on 1/20/16 and 1/21/16, Jason Calloway walked attendees through the basics of scalable POSIX file systems in the cloud.
The document discusses the NoSQL movement and non-relational databases. It provides background on the limitations of relational databases that led to the development of NoSQL databases. Examples of NoSQL databases are described like Voldemort, CouchDB, and Cassandra. Benefits of NoSQL databases include horizontal scaling, high availability, and faster performance.
ارائه در زمینه کلان داده،
کارگاه آموزشی "عصر کلان داده، چرا و چگونه؟" در بیست و دومین کنفرانس انجمن کامپیوتر ایران csicc2017.ir
وحید امیری
vahidamiry.ir
datastack.ir
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016StampedeCon
The document discusses using a data lake approach with EMC Isilon storage to address various business use cases. It describes how the solution provides shared storage for multiple workloads through multi-protocol support, enables data protection and isolation of client data, and allows testing applications across Hadoop distributions through a common platform. Examples are given of how this approach supports an enterprise data hub, data warehouse offloading, data integration, and enrichment services.
This document provides an overview of cloud architecture best practices and Amazon Web Services. It discusses the advantages of using cloud stacks over individual physical servers, as cloud stacks provide scalability, redundancy, and a pay-as-you-use model. The document then outlines various AWS services for computing, storage, databases, security, and management and explains how AWS allows horizontal scaling of servers and storage as needed.
Similar to Hadoop Tutorial, Usage, Evolution, Data Lake, Business Intelligence by Sunitha Flowerhill (20)
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Population Growth in Bataan: The effects of population growth around rural pl...
Hadoop Tutorial, Usage, Evolution, Data Lake, Business Intelligence by Sunitha Flowerhill
1. HADOOP OVERVIEW
By Sunitha Flowerhill
(Masters in Computer Applications-MCA)
Data, Business Intelligence and Hadoop Architect
2. AGENDA
EVOLUTION AND EXPANSION OF BUSINESS DATA PROCESSING
MOTIVATION BEHIND HADOOP
HADOOP ARCHITECTURE
HADOOP TECHNOLOGIES AND USAGES
DATA WRANGLING ON HADOOP
BUSINESS INTELLIGENCE AND ANALYTICS ON HADOOP
3. EVOLUTION – STAGE 1
70S – PUNCH CARDS AND PUNCH TAPES WITH HOLES IN IT
COBOL AND JOB CONTROL LANGUAGE
ISAM AND C-ISAM FILES – FLAT FILES WITH INDEXES
WINCHESTER HARD DISKS WHICH LOOKED LIKE DRUMS
EXAMPLE SYSTEM – PDP 11 BY DIGITAL CORPORATION
DRAWBACK – VERY SLOW, LOW CAPACITY
4. EVOLUTION – STAGE 2
80’S - CAME MINI COMPUTERS
UNIX OPERATING SYSTEM (WHICH WAS DEVELOPED IN THE
60S IN UC-BERKLEY) WHICH IS STILL RUNNING IN MANY
FORMS LIKE HPUX, AIX AND ALSO IS THE MAJOR OPERATING
SYSTEM WHERE HADOOP RESIDES - LINUX
RELATIONAL DATABASE SYSTEMS LIKE UNIFY, INFORMIX,
SYBASE AND DB2
LAN BASED NETWORKED PCS – NOVELL NETWARE, DBASE,
FOXPRO – PC/MD-DOS/LAN BASED RDBMS
SQL – STRUCTURED QUERY LANGUAGE, WHICH IS STILL
HEAVILY USED IN HADOOP AS HIVEQL, SPARK SQL ETC.
STURDY AND FAULT TOLERANT
DRAWBACK: LIMITED PROCESSING POWER AND GREEN
SCREEN! NOT MUCH OF A GRAPHICAL EXPERIENCE
5. EVOLUTION – STAGE 3
CLIENT SERVER ARCHITECTURE – 2 TIER – PC BASED THICK CLIENT
FRONT END FOR PROCESSING DATA AT THE USER END AND A LAN
OR UNIX BASED SERVER FOR THE DATABASE SERVERSIDE
PROCESSING
GRAPHICAL USER INTERFACE (GUI) FOR THE USER
MORE PROCESSING POWER AT THE SERVER SIDE
CONNECTION BETWEEN CLIENT AND SERVER USING OBJECT DATA
BASE CONNECTIVITY (ODBC) OR CALL LEVEL INTERFACE (CLI) –
USING DYNAMIC LINK LIBRARIES (DLLS)
CLASSIFIED AS DISTRIBUTED SYSTEMS
DATA STORAGE AND RECOVERY MECHANISMS SUCH AS
MIRRORING, REPLICATION, BLADING ETC WERE POSSIBLE AT THE
SERVER LEVEL
DRAWBACK : LOW AVAILABILITY, FAILURES, LOTS OF
TROUBLESHOOTING
“You know you have a distributed system when
the crash of a computer you’ve never
heard of stops you from getting any work
done.” -Leslie Lamport – distributed system computer scientist
6. EVOLUTION – STAGE 4
3 TIER ARCHITECTURE – THIN CLIENT, APPLICATION-
MIDDLEWARE AND SERVERS FOR DATABASE STORAGE
THIN APPLICATION CLIENT OR WEB BASED CLIENT, WHICH ONLY
SERVES AS DATA DELIVERY, WITH MINIMAL PROCESSING AT
CLIENT END
INTRODUCTION OF MIDDLEWARE SUCH AS TUXEDO, WEB
SERVICES, JAVA BEANS – MOST OF BUSINESS LOGIC RESIDES
HERE
USES PACKET TECHNOLOGY FOR EFFICIENT TRANSPORTATION
AND RECOVERY
USES DIFFERENT INTERNET PROTOCOLS FOR SECURITY AND
EFFICIENT TRANSPORTATION OF DATA BETWEEN THIN CLIENT
AND SERVER
MORE GEOGRAPHICALLY DISTRIBUTED SERVERS, MIDDLEWARE
SERVERS, CLUSTER COMPUTING, CHEAP HARDWARE
LOT OF DATA CAPTURING ACROSS THE INTERNET, FROM SELF
SERVICE APPLICATIONS, USERS, MOBILE APPLICATIONS
7. THAT BRINGS US TO THE MOTIVATION BEHIND
HADOOP
CHEAP CLUSTERED HARDWARE AVAILABLE NOW
WE CAN RUN A HADOOP CLUSTER WITH ALL THE LAPTOPS
IN THIS CLASS CONNECTED TOGETHER AS NODES OF THE
CLUSTER
HARDWARE FAILURE IS COMMON SO HEAVILY REPLICATED
DATA
MULTIPLE PARALLEL PROCESSING – USAGE OF MULTIPLE
CPUS FOR A SINGLE TASK –SPARK ENGINE IS A GOOD
EXAMPLE OF MPP.
VARIOUS ANALYSIS CAN BE DONE IN LARGE DATASETS,
FORECASTING, PREDICTIONS, DIRECTIONS FOR BUSINESS
ANALYTICS BASED INTELLIGENCE RATHER THAN PURE
PRODUCTION BASED MIS REPORTS
SELLING OF THE DATASETS – HUGE BUSINESS
AND MANY MORE…….
8. HADOOP
WE ARE DEALING WITH TERABYTES OF DATA HERE IN CLUSTERED
COMPUTING
APACHE TOP LEVEL PROJECT, OPEN SOURCE IMPLEMENTATION,
FOR RELIABLE, SCALABLE, DISTRIBUTED COMPUTING AND STORAGE.
DISTRIBUTED BY HORTONWORKS AND CLOUDERA
FLEXIBLE AND HIGHLY-AVAILABLE ARCHITECTURE FOR LARGE
SCALE COMPUTATION AND DATA PROCESSING ON A NETWORK
OF COMMODITY HARDWARE.
STORAGE AND PROCESSING OF LARGE AND RAPIDLY GROWING
DATA.
STRUCTURED AND UNSTRUCTURED DATA
HIGH SCALABILITY AND AVAILABILITY
FAULT TOLERANCE
NOW INFRASTRUCTURE MAINTENANCE IS AVAILABLE AT LOW COST
BY CLOUD COMPANIES LIKE AWS, GOOGLE, GAIA, MS AZURE ETC
9. BASIC ARCHITECTURE
MAIN NODES OF CLUSTER ARE WHERE MOST
OF THE COMPUTATIONAL POWER AND
STORAGE OF THE SYSTEM LIES
MAIN NODES RUN TASKTRACKER TO ACCEPT
AND REPLY TO MAPREDUCE TASKS, AND
ALSO TO DATA NODE TO STORE NEEDED
BLOCKS AS AVAILABLE AS POSSIBLE
CENTRAL CONTROL NODE RUNS NAMENODE
TO KEEP TRACK OF HDFS DIRECTORIES &
FILES, AND JOBTRACKER TO DISPATCH
COMPUTE TASKS TO TASKTRACKER
HADOOP IS WRITTEN IN JAVA, ALSO
SUPPORTS PYTHON, RUBY OTHER ENGINES
LIKE SPARK, MORE EFFICIENT LANGUAGES LIKE
SCALA
10.
11. HADOOP DISTRIBUTED FILESYSTEM
(HDFS) ARCHITECTURE
TAILORED TO THE NEEDS OF MAPREDUCE
TARGETED TOWARDS MANY READS OF
FILESTREAMS
WRITES ARE MORE COSTLY – TIME, EFFORT –
SO WRITE ONCE – READ MANY PREFERRED
HIGH DEGREE OF DATA REPLICATION (3X BY
DEFAULT)
LARGE BLOCKSIZE (128 MB)
LOCATION AWARENESS OF DATA NODES IN
NETWORK (GEOGRAPHIC SENSIBLE STORAGE)
Cluster of machines running
Hadoop at Yahoo! (Source: Yahoo!)
12. ARCHITECTURE - NAMENODE
STORES METADATA FOR THE FILES, LIKE THE
DIRECTORY STRUCTURE OF A TYPICAL FS
THE SERVER HOLDING THE NAMENODE
INSTANCE IS QUITE CRUCIAL, AS THERE IS
ONLY ONE. AND THERE IS A SECONDARY OR
BACKUP NAMENODE
TRANSACTION LOG FOR FILE DELETES/ADDS,
ETC. DOES NOT USE TRANSACTIONS FOR
WHOLE BLOCKS OR FILE-STREAMS, ONLY
METADATA
HANDLES CREATION OF MORE REPLICA
BLOCKS WHEN NECESSARY AFTER A DATA
NODE FAILURE
13. ARCHITECTURE - NAMENODE:
STORES THE ACTUAL DATA IN HDFS
CAN RUN ON ANY UNDERLYING
FILESYSTEM (EXT 3/4, NTFS, ETC.)
NOTIFIES NAMENODE OF WHAT BLOCKS
IT HAS
NAMENODE REPLICATES BLOCKS 2X IN
LOCAL RACK, 1X ELSEWHERE
14. ARCHITECTURE – JOBTRACKER AND TASKTRACKER
JOB TRACKER MAKES SURE THAT
EACH OPERATION IS COMPLETED
AND IF THERE IS A PROCESS
FAILURE AT ANY NODE, IT NEEDS
TO ASSIGN A DUPLICATE TASK TO
SOME TASK TRACKER. JOB
TRACKER ALSO DISTRIBUTES THE
ENTIRE TASK TO ALL THE
MACHINES.
THE TASK TRACKERS (PROJECT
MANAGER IN OUR ANALOGY) IN
DIFFERENT MACHINES
ARE COORDINATED BY A JOB
TRACKER
15. ARCHITECTURE – YARN (YET ANOTHER
RESOURCE NEGOTIATOR)
YARN ARCHITECTURE CAN BE A
LITTLE CONFUSING..
HADOOP 2.0 INTRODUCED YARN
(YET ANOTHER RESOURCE
NEGOTIATOR) AS HADOOP MOVED
FROM MAP REDUCE TO MORE
GENERIC MODEL, WITH ABILITY TO
SUPPORT APACHE SPARK AND
OTHER REAL TIME ENGINES.
ITS BASICALLY MULTI THREADING –
MORE INSTANCES OF AN
APPLICATION MANAGED BY A
MASTER-MANAGER
EXPAND THIS IDEA TO A CLUSTER. A
NUMBER OF APPLICATIONS MAY BE
SPAWNED BY A
CORRESPONDING APPLICATION
MASTER TASKS OR WORKERS ARE
RUN AND MANAGED BY
APPLICATION MASTER. APPLICATION
MASTER REQUESTS RESOURCE
MANAGER, WHO ALLOCATE
RESOURCES
16. TECHNOLOGIES ON HADOOP
ECOSYSTEM – WHERE ALL TOOLS RESIDES IN UNION,
LIKE A POND ECOSYSTEM
DATA PONDS, DATA LAKES AND DATA RESERVOIRS -
WHICH ARE REPLACING TRADITIONAL DATA
WAREHOUSES
EFFICIENT BUSINESS INTELLIGENCES BY PREDICTION
AND FORECASTING
ALGORITHMS FOR MACHINE LEARNING AND DEEP
LEARNING
WEB NOTEBOOKS E.G.. ZEPPELIN
DATABASES AND SQL – NOSQL DATABASES – NON-
RELATIONAL DATABASES – CASSANDRA, HBASE,
HIVEQL, SPARKQL
17. TECHNOLOGIES ON HADOOP
OPEN APIS FOR OPERATING ON DOCUMENTS – OPEN
JSON
STREAM PROCESSING – DATA STREAMING – SPARK
STREAMING, APACHE STORM, REAL-TIME, EVENT
BASED – EX: FACEBOOK LIVE, REAL TIME DATA
STREAMING FOR DATA LAKES
MESSAGING PLATFORMS – APACHE KAFKA – USED BY
LINKEDIN FOR MESSAGING, ANALYTICS, WITHOUT
HAVING TO PERFORM ANY KIND OF DATA MOVEMENT
EX: GROUPME, FACEBOOK MESSENGER
GLOBAL RESOURCE MANAGEMENT - THE ABILITY TO
PRESSURIZE THE RESOURCES (CPU, MEMORY,
BANDWIDTH) OF AN APPLICATION. - BUSINESSES CAN
GREATLY INCREASE THEIR MOMENTUM WHEN THEY
ARE ABLE TO USE THEIR ASSETS FOR CRITICAL
PROJECTS
18. DATA PREPARATION,
WRANGLING,ANALYSIS ON HADOOP
VARIOUS ALGORITHMS FOR
METADATA EXTRACTION
FORMAT CONVERSION
MDM IDENTIFICATION
CROSS LINKING AMONG VARIOUS DATA
CENTRALIZED INDEXING, TAGS, BUSINESS
METADATA, TECHNICAL METADATA
TEXTUAL PATTERN RECOGNITION
MOST OF THESE TOOLS ARE
SELF SERVICE ONES
DATA INTEGRATION
19. BUSINESS INTELLIGENCE ON HADOOP
SEARCH ENGINE TOOLS FOR OFFICE DATA
DIGGING OR MINING, WITH RANKED RESULTS
AND SUGGESTIONS. EXAMPLE – ELASTIC
SEARCH
CUBING TOOLS – PREPARE DATA, COMPUTE
COMPLEX CALCULATIONS AND KEEP FOR
CONSUMPTION/REPORTING. EX: ATSCALE,
TRIFACTA
STATISTICAL TOOLS – JMP AND SAS
GEOSPATIAL TOOLS AND ACCESSORIES – EX:
ESRI SPECIAL FRAMEWORK
TARGET MARKETING – EX: ELECTION
SOLICITING TO TARGET AUDIENCE OVER
SOCIAL MEDIA
DECENTRALIZED ANALYTICS – ANALYSIS
DIVIDED ONTO MULTIPLE LOCATIONS,
MULTIPLE TALENTS AND THEN CONVERGE
INTO GOOD RESULTS