This document provides an introduction to big data concepts. It discusses how traditional architectures have limitations for solving problems involving large weekly increases in data in the petabyte range from diverse sources. Apache Hadoop is presented as a solution using a clustered architecture that is scalable, flexible and cost-efficient. Key aspects of Hadoop covered include its use of commodity hardware, storage of data across clusters of nodes, and benchmarks for sorting large datasets efficiently.
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...Edureka!
This Edureka "What is Hadoop" Tutorial (check our hadoop blog series here: https://goo.gl/lQKjL8) will help you understand all the basics of Hadoop. Learn about the differences in traditional and hadoop way of storing and processing data in detail. Below are the topics covered in this tutorial:
1) Traditional Way of Processing - SEARS
2) Big Data Growth Drivers
3) Problem Associated with Big Data
4) Hadoop: Solution to Big Data Problem
5) What is Hadoop?
6) HDFS
7) MapReduce
8) Hadoop Ecosystem
9) Demo: Hadoop Case Study - Orbitz
Subscribe to our channel to get updates.
Check our complete Hadoop playlist here: https://goo.gl/4OyoTW
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
This video on Hadoop interview questions part-1 will take you through the general Hadoop questions and questions on HDFS, MapReduce and YARN, which are very likely to be asked in any Hadoop interview. It covers all the topics on the major components of Hadoop. This Hadoop tutorial will give you an idea about the different scenario-based questions you could face and some multiple-choice questions as well. Now, let us dive into this Hadoop interview questions video and gear up for youe next Hadoop Interview.
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
This document provides an introduction and overview of HDFS and MapReduce in Hadoop. It describes HDFS as a distributed file system that stores large datasets across commodity servers. It also explains that MapReduce is a framework for processing large datasets in parallel by distributing work across clusters. The document gives examples of how HDFS stores data in blocks across data nodes and how MapReduce utilizes mappers and reducers to analyze datasets.
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsWilliam Yetman
This was one of my first presentations on Big Data at Ancestry.com. The audience was split between Family Historians interested in the Technology and Developers interested in our Big Data Story. So the presentation is a mix. I think there is plenty for a someone with an interest in technology and enough meat for a "technologist".
Keep this in mind as you look at this presentation.
Thanks,
-Bill-
The document outlines the steps to set up a Hadoop cluster and run a MapReduce job across the cluster. It describes cloning Hadoop from the master node to two slave nodes, configuring settings like the hosts file and SSH keys for access. The document then details formatting the HDFS, starting services on all nodes, importing data and running a sample MapReduce word count job on the cluster. Finally, it discusses stopping the Hadoop services on all nodes to shut down the cluster.
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
This document provides an introduction to big data and the installation of a single-node Apache Hadoop cluster. It defines key terms like big data, Hadoop, and MapReduce. It discusses traditional approaches to handling big data like storage area networks and their limitations. It then introduces Hadoop as an open-source framework for storing and processing vast amounts of data in a distributed fashion using the Hadoop Distributed File System (HDFS) and MapReduce programming model. The document outlines Hadoop's architecture and components, provides an example of how MapReduce works, and discusses advantages and limitations of the Hadoop framework.
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Ashok Royal
Bigdata Hadoop, Its components and a Hadoop project is described in Details.
Visit http://hadoop-beginners.blogspot.com to see Hadoop Tutorials.
Thanks for the visit. :)
This presentation provides an overview of Hadoop, including:
- A brief history of data and the rise of big data from various sources.
- An introduction to Hadoop as an open source framework used for distributed processing and storage of large datasets across clusters of computers.
- Descriptions of the key components of Hadoop - HDFS for storage, and MapReduce for processing - and how they work together in the Hadoop architecture.
- An explanation of how Hadoop can be installed and configured in standalone, pseudo-distributed and fully distributed modes.
- Examples of major companies that use Hadoop like Amazon, Facebook, Google and Yahoo to handle their large-scale data and analytics needs.
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...Edureka!
This Edureka "What is Hadoop" Tutorial (check our hadoop blog series here: https://goo.gl/lQKjL8) will help you understand all the basics of Hadoop. Learn about the differences in traditional and hadoop way of storing and processing data in detail. Below are the topics covered in this tutorial:
1) Traditional Way of Processing - SEARS
2) Big Data Growth Drivers
3) Problem Associated with Big Data
4) Hadoop: Solution to Big Data Problem
5) What is Hadoop?
6) HDFS
7) MapReduce
8) Hadoop Ecosystem
9) Demo: Hadoop Case Study - Orbitz
Subscribe to our channel to get updates.
Check our complete Hadoop playlist here: https://goo.gl/4OyoTW
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
This video on Hadoop interview questions part-1 will take you through the general Hadoop questions and questions on HDFS, MapReduce and YARN, which are very likely to be asked in any Hadoop interview. It covers all the topics on the major components of Hadoop. This Hadoop tutorial will give you an idea about the different scenario-based questions you could face and some multiple-choice questions as well. Now, let us dive into this Hadoop interview questions video and gear up for youe next Hadoop Interview.
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
This document provides an introduction and overview of HDFS and MapReduce in Hadoop. It describes HDFS as a distributed file system that stores large datasets across commodity servers. It also explains that MapReduce is a framework for processing large datasets in parallel by distributing work across clusters. The document gives examples of how HDFS stores data in blocks across data nodes and how MapReduce utilizes mappers and reducers to analyze datasets.
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsWilliam Yetman
This was one of my first presentations on Big Data at Ancestry.com. The audience was split between Family Historians interested in the Technology and Developers interested in our Big Data Story. So the presentation is a mix. I think there is plenty for a someone with an interest in technology and enough meat for a "technologist".
Keep this in mind as you look at this presentation.
Thanks,
-Bill-
The document outlines the steps to set up a Hadoop cluster and run a MapReduce job across the cluster. It describes cloning Hadoop from the master node to two slave nodes, configuring settings like the hosts file and SSH keys for access. The document then details formatting the HDFS, starting services on all nodes, importing data and running a sample MapReduce word count job on the cluster. Finally, it discusses stopping the Hadoop services on all nodes to shut down the cluster.
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Mahantesh Angadi
This document provides an introduction to big data and the installation of a single-node Apache Hadoop cluster. It defines key terms like big data, Hadoop, and MapReduce. It discusses traditional approaches to handling big data like storage area networks and their limitations. It then introduces Hadoop as an open-source framework for storing and processing vast amounts of data in a distributed fashion using the Hadoop Distributed File System (HDFS) and MapReduce programming model. The document outlines Hadoop's architecture and components, provides an example of how MapReduce works, and discusses advantages and limitations of the Hadoop framework.
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Ashok Royal
Bigdata Hadoop, Its components and a Hadoop project is described in Details.
Visit http://hadoop-beginners.blogspot.com to see Hadoop Tutorials.
Thanks for the visit. :)
This presentation provides an overview of Hadoop, including:
- A brief history of data and the rise of big data from various sources.
- An introduction to Hadoop as an open source framework used for distributed processing and storage of large datasets across clusters of computers.
- Descriptions of the key components of Hadoop - HDFS for storage, and MapReduce for processing - and how they work together in the Hadoop architecture.
- An explanation of how Hadoop can be installed and configured in standalone, pseudo-distributed and fully distributed modes.
- Examples of major companies that use Hadoop like Amazon, Facebook, Google and Yahoo to handle their large-scale data and analytics needs.
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Edureka!
This Hadoop Tutorial on Hadoop Interview Questions and Answers ( Hadoop Interview Blog series: https://goo.gl/ndqlss ) will help you to prepare yourself for Big Data and Hadoop interviews. Learn about the most important Hadoop interview questions and answers and know what will set you apart in the interview process. Below are the topics covered in this Hadoop Interview Questions and Answers Tutorial:
Hadoop Interview Questions on:
1) Big Data & Hadoop
2) HDFS
3) MapReduce
4) Apache Hive
5) Apache Pig
6) Apache HBase and Sqoop
Check our complete Hadoop playlist here: https://goo.gl/4OyoTW
#HadoopInterviewQuestions #BigDataInterviewQuestions #HadoopInterview
2014 feb 24_big_datacongress_hadoopsession1_hadoop101Adam Muise
This document provides an introduction to Hadoop using the Hortonworks Sandbox virtual machine. It discusses how Hadoop was created to address the limitations of traditional data architectures for handling large datasets. It then describes the key components of Hadoop like HDFS, MapReduce, YARN and Hadoop distributions like Hortonworks Data Platform. The document concludes by explaining how to get started with the Hortonworks Sandbox VM which contains a single node Hadoop cluster within a virtual machine, avoiding the need to install Hadoop locally.
Hadoop Hand-on Lab: Installing Hadoop 2IMC Institute
This document is the agenda for a hands-on workshop on Big Data using Hadoop. It includes an introduction to Big Data concepts, the Hadoop ecosystem, and instructions for installing Hadoop on an Amazon EC2 virtual server in pseudo-distributed mode. The workshop agenda covers launching an EC2 instance, installing Java, downloading and extracting Hadoop, configuring Hadoop, formatting the namenode, and starting the Hadoop processes.
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsIMC Institute
This document provides an overview of a hands-on workshop on running Hadoop on Amazon Elastic MapReduce. It discusses setting up an AWS account, signing up for necessary services like S3 and EC2, creating an S3 bucket, generating access keys, creating a new EMR job flow, and viewing results from the S3 bucket. It also covers installing and running Hadoop locally, importing and reviewing data in HDFS, and the MapReduce programming model.
10 Popular Hadoop Technical Interview QuestionsZaranTech LLC
The document discusses 10 common questions that may be asked in a Hadoop technical interview. It provides definitions for big data and the four V's of big data (volume, variety, veracity, velocity). It also discusses how businesses use big data analytics to increase revenue, examples of companies that use Hadoop, the difference between structured and unstructured data, the concepts that Hadoop works on (HDFS and MapReduce), core Hadoop components, hardware requirements for running Hadoop, common input formats, and some common Hadoop tools. Overall, the document outlines essential information about big data and Hadoop that may be helpful to review for a technical interview.
This document provides an overview of 4 solutions for processing big data using Hadoop and compares them. Solution 1 involves using core Hadoop processing without data staging or movement. Solution 2 uses BI tools to analyze Hadoop data after a single CSV transformation. Solution 3 creates a data warehouse in Hadoop after a single transformation. Solution 4 implements a traditional data warehouse. The solutions are then compared based on benefits like cloud readiness, parallel processing, and investment required. The document also includes steps for installing a Hadoop cluster and running sample MapReduce jobs and Excel processing.
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.
This document provides an overview of a masterclass on big data presented by Prof.dr.ir. Arjen P. de Vries. It discusses defining properties of big data, challenges in big data analytics including capturing, aligning, transforming, modeling and understanding large datasets. It also briefly introduces map-reduce and streaming data analysis. Examples of large datasets that could be analyzed are provided, such as the sizes of datasets from Facebook, Google and other organizations.
This document discusses big data and Hadoop. It notes that traditional technologies are not well-suited to handle the volume of data generated today. Hadoop was created by companies like Google and Yahoo to address this challenge through its distributed file system HDFS and processing framework MapReduce. The document promotes Hadoop and the Hortonworks Data Platform for storing, processing, and analyzing large volumes of diverse data in a cost-effective manner.
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
The document summarizes a technical seminar presentation on scheduling methods in the Hadoop MapReduce framework. The presentation covers the motivation for Hadoop and MapReduce, provides an introduction to big data and Hadoop, and describes HDFS and the MapReduce programming model. It then discusses challenges in MapReduce scheduling and surveys the literature on existing scheduling methods. The presentation surveys five papers on proposed MapReduce scheduling methods, summarizing the key points of each. It concludes that improving data locality can enhance performance and that future work could consider scheduling algorithms for heterogeneous clusters.
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...Edureka!
This Edureka Hadoop Administration Training tutorial will help you understand the functions of all the Hadoop daemons and what are the configuration parameters involved with them. It will also take you through a step by step Multi-Node Hadoop Installation and will discuss all the configuration files in detail. Below are the topics covered in this tutorial:
1) What is Big Data?
2) Hadoop Ecosystem
3) Hadoop Core Components: HDFS & YARN
4) Hadoop Core Configuration Files
5) Multi Node Hadoop Installation
6) Tuning Hadoop using Configuration Files
7) Commissioning and Decommissioning the DataNode
8) Hadoop Web UI Components
9) Hadoop Job Responsibilities
This document provides an overview of installing and configuring Apache Hadoop. It begins with background on big data and Hadoop, including definitions of big data, the Hadoop ecosystem, and differences between Hadoop 1.0 and 2.0. It then discusses installing Hadoop, describing the steps to set up a Cloudera cluster on Amazon Web Services and requirements for installing Cloudera Manager. The document concludes with mentioning a lab to set up a Cloudera cluster on AWS.
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...Simplilearn
This presentation about Hadoop will help you understand what is Big Data, what is Hadoop, how Hadoop came into existence, what are the various components of Hadoop and an explanation on Hadoop use case. In the current time, there is a lot of data being generated every day and this massive amount of data cannot be stored, processed and analyzed using the traditional ways. That is why Hadoop can into existence as a solution for Big Data. Hadoop is a framework that manages Big Data storage in a distributed way and processes it parallelly. Now, let us get started and understand the importance of Hadoop and why we actually need it.
Below topics are explained in this Hadoop presentation:
1. The rise of Big Data
2. What is Big Data?
3. Big Data and its challenges
4. Hadoop as a solution
5. What is Hadoop?
6. Components of Hadoop
7. Use case of Hadoop
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses a simple programming model called MapReduce that automatically parallelizes and distributes work across nodes. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and MapReduce execution engine for processing. HDFS stores data as blocks replicated across nodes for fault tolerance. MapReduce jobs are split into map and reduce tasks that process key-value pairs in parallel. Hadoop is well-suited for large-scale data analytics as it scales to petabytes of data and thousands of machines with commodity hardware.
Big Data Programming Using Hadoop WorkshopIMC Institute
The document describes instructions for running a hands-on workshop on Hadoop and MapReduce. It includes steps for starting a Cloudera VM, importing and exporting data from HDFS, and writing a basic word count MapReduce program. The MapReduce process involves a map step that processes input records and outputs key-value pairs, and a reduce step that combines all intermediate values associated with the same key.
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
This Hadoop MapReduce tutorial will unravel MapReduce Programming, MapReduce Commands, MapReduce Fundamentals, Driver Class, Mapper Class, Reducer Class, Job Tracker & Task Tracker.
At the end, you'll have a strong knowledge regarding Hadoop MapReduce Basics.
PPT Agenda:
✓ Introduction to BIG Data & Hadoop
✓ What is MapReduce?
✓ MapReduce Data Flows
✓ MapReduce Programming
----------
What is MapReduce?
MapReduce is a programming framework for distributed processing of large data-sets via commodity computing clusters. It is based on the principal of parallel data processing, wherein data is broken into smaller blocks rather than processed as a single block. This ensures a faster, secure & scalable solution. Mapreduce commands are based in Java.
----------
What are MapReduce Components?
It has the following components:
1. Combiner: The combiner collates all the data from the sample set based on your desired filters. For example, you can collate data based on day, week, month and year. After this, the data is prepared and sent for parallel processing.
2. Job Tracker: This allocates the data across multiple servers.
3. Task Tracker: This executes the program across various servers.
4. Reducer: It will isolate the desired output from across the multiple servers.
----------
Applications of MapReduce
1. Data Mining
2. Document Indexing
3. Business Intelligence
4. Predictive Modelling
5. Hypothesis Testing
----------
Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance.
Email: sales@skillspeed.com
Website: https://www.skillspeed.com
The document is a presentation on big data and Hadoop. It introduces the speaker, Adam Muise, and discusses the challenges of dealing with large and diverse datasets. Traditional approaches of separating data into silos are no longer sufficient. The presentation argues that a distributed system like Hadoop is needed to bring all data together and enable it to be analyzed as a whole.
Hadoop at the Center: The Next Generation of HadoopAdam Muise
This document discusses Hortonworks' approach to addressing challenges around managing large volumes of diverse data. It presents Hortonworks' Hadoop Data Platform (HDP) as a solution for consolidating siloed data into a central data lake on a single cluster. This allows different data types and workloads like batch, interactive, and real-time processing to leverage shared services for security, governance and operations while preserving existing tools. The HDP also enables new use cases for analytics like real-time personalization and segmentation using diverse data sources.
This document discusses big data and Hadoop. It defines big data as large amounts of unstructured data that would be too costly to store and analyze in a traditional database. It then describes how Hadoop provides a solution to this challenge through distributed and parallel processing across clusters of commodity hardware. Key aspects of Hadoop covered include HDFS for reliable storage, MapReduce for distributed computing, and how together they allow scalable analysis of very large datasets. Popular users of Hadoop like Amazon, Yahoo and Facebook are also mentioned.
This document defines key terms related to big data such as structured data, unstructured data, and semi-structured data. It discusses how data is generated from various sources and factors like sensors, social networks, and online shopping. It explains that big data refers to data that is too large to process using traditional methods due to its volume, velocity, and variety. Hadoop is introduced as an open source framework that uses HDFS for distributed storage and MapReduce for distributed processing of large data sets across computer clusters.
This document provides an overview of Hadoop, a tool for processing large datasets across clusters of computers. It discusses why big data has become so large, including exponential growth in data from the internet and machines. It describes how Hadoop uses HDFS for reliable storage across nodes and MapReduce for parallel processing. The document traces the history of Hadoop from its origins in Google's file system GFS and MapReduce framework. It provides brief explanations of how HDFS and MapReduce work at a high level.
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...Edureka!
This Hadoop Tutorial on Hadoop Interview Questions and Answers ( Hadoop Interview Blog series: https://goo.gl/ndqlss ) will help you to prepare yourself for Big Data and Hadoop interviews. Learn about the most important Hadoop interview questions and answers and know what will set you apart in the interview process. Below are the topics covered in this Hadoop Interview Questions and Answers Tutorial:
Hadoop Interview Questions on:
1) Big Data & Hadoop
2) HDFS
3) MapReduce
4) Apache Hive
5) Apache Pig
6) Apache HBase and Sqoop
Check our complete Hadoop playlist here: https://goo.gl/4OyoTW
#HadoopInterviewQuestions #BigDataInterviewQuestions #HadoopInterview
2014 feb 24_big_datacongress_hadoopsession1_hadoop101Adam Muise
This document provides an introduction to Hadoop using the Hortonworks Sandbox virtual machine. It discusses how Hadoop was created to address the limitations of traditional data architectures for handling large datasets. It then describes the key components of Hadoop like HDFS, MapReduce, YARN and Hadoop distributions like Hortonworks Data Platform. The document concludes by explaining how to get started with the Hortonworks Sandbox VM which contains a single node Hadoop cluster within a virtual machine, avoiding the need to install Hadoop locally.
Hadoop Hand-on Lab: Installing Hadoop 2IMC Institute
This document is the agenda for a hands-on workshop on Big Data using Hadoop. It includes an introduction to Big Data concepts, the Hadoop ecosystem, and instructions for installing Hadoop on an Amazon EC2 virtual server in pseudo-distributed mode. The workshop agenda covers launching an EC2 instance, installing Java, downloading and extracting Hadoop, configuring Hadoop, formatting the namenode, and starting the Hadoop processes.
Big Data Hadoop using Amazon Elastic MapReduce: Hands-On LabsIMC Institute
This document provides an overview of a hands-on workshop on running Hadoop on Amazon Elastic MapReduce. It discusses setting up an AWS account, signing up for necessary services like S3 and EC2, creating an S3 bucket, generating access keys, creating a new EMR job flow, and viewing results from the S3 bucket. It also covers installing and running Hadoop locally, importing and reviewing data in HDFS, and the MapReduce programming model.
10 Popular Hadoop Technical Interview QuestionsZaranTech LLC
The document discusses 10 common questions that may be asked in a Hadoop technical interview. It provides definitions for big data and the four V's of big data (volume, variety, veracity, velocity). It also discusses how businesses use big data analytics to increase revenue, examples of companies that use Hadoop, the difference between structured and unstructured data, the concepts that Hadoop works on (HDFS and MapReduce), core Hadoop components, hardware requirements for running Hadoop, common input formats, and some common Hadoop tools. Overall, the document outlines essential information about big data and Hadoop that may be helpful to review for a technical interview.
This document provides an overview of 4 solutions for processing big data using Hadoop and compares them. Solution 1 involves using core Hadoop processing without data staging or movement. Solution 2 uses BI tools to analyze Hadoop data after a single CSV transformation. Solution 3 creates a data warehouse in Hadoop after a single transformation. Solution 4 implements a traditional data warehouse. The solutions are then compared based on benefits like cloud readiness, parallel processing, and investment required. The document also includes steps for installing a Hadoop cluster and running sample MapReduce jobs and Excel processing.
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.
This document provides an overview of a masterclass on big data presented by Prof.dr.ir. Arjen P. de Vries. It discusses defining properties of big data, challenges in big data analytics including capturing, aligning, transforming, modeling and understanding large datasets. It also briefly introduces map-reduce and streaming data analysis. Examples of large datasets that could be analyzed are provided, such as the sizes of datasets from Facebook, Google and other organizations.
This document discusses big data and Hadoop. It notes that traditional technologies are not well-suited to handle the volume of data generated today. Hadoop was created by companies like Google and Yahoo to address this challenge through its distributed file system HDFS and processing framework MapReduce. The document promotes Hadoop and the Hortonworks Data Platform for storing, processing, and analyzing large volumes of diverse data in a cost-effective manner.
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
The document summarizes a technical seminar presentation on scheduling methods in the Hadoop MapReduce framework. The presentation covers the motivation for Hadoop and MapReduce, provides an introduction to big data and Hadoop, and describes HDFS and the MapReduce programming model. It then discusses challenges in MapReduce scheduling and surveys the literature on existing scheduling methods. The presentation surveys five papers on proposed MapReduce scheduling methods, summarizing the key points of each. It concludes that improving data locality can enhance performance and that future work could consider scheduling algorithms for heterogeneous clusters.
Hadoop Administration Training | Hadoop Administration Tutorial | Hadoop Admi...Edureka!
This Edureka Hadoop Administration Training tutorial will help you understand the functions of all the Hadoop daemons and what are the configuration parameters involved with them. It will also take you through a step by step Multi-Node Hadoop Installation and will discuss all the configuration files in detail. Below are the topics covered in this tutorial:
1) What is Big Data?
2) Hadoop Ecosystem
3) Hadoop Core Components: HDFS & YARN
4) Hadoop Core Configuration Files
5) Multi Node Hadoop Installation
6) Tuning Hadoop using Configuration Files
7) Commissioning and Decommissioning the DataNode
8) Hadoop Web UI Components
9) Hadoop Job Responsibilities
This document provides an overview of installing and configuring Apache Hadoop. It begins with background on big data and Hadoop, including definitions of big data, the Hadoop ecosystem, and differences between Hadoop 1.0 and 2.0. It then discusses installing Hadoop, describing the steps to set up a Cloudera cluster on Amazon Web Services and requirements for installing Cloudera Manager. The document concludes with mentioning a lab to set up a Cloudera cluster on AWS.
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...Simplilearn
This presentation about Hadoop will help you understand what is Big Data, what is Hadoop, how Hadoop came into existence, what are the various components of Hadoop and an explanation on Hadoop use case. In the current time, there is a lot of data being generated every day and this massive amount of data cannot be stored, processed and analyzed using the traditional ways. That is why Hadoop can into existence as a solution for Big Data. Hadoop is a framework that manages Big Data storage in a distributed way and processes it parallelly. Now, let us get started and understand the importance of Hadoop and why we actually need it.
Below topics are explained in this Hadoop presentation:
1. The rise of Big Data
2. What is Big Data?
3. Big Data and its challenges
4. Hadoop as a solution
5. What is Hadoop?
6. Components of Hadoop
7. Use case of Hadoop
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses a simple programming model called MapReduce that automatically parallelizes and distributes work across nodes. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and MapReduce execution engine for processing. HDFS stores data as blocks replicated across nodes for fault tolerance. MapReduce jobs are split into map and reduce tasks that process key-value pairs in parallel. Hadoop is well-suited for large-scale data analytics as it scales to petabytes of data and thousands of machines with commodity hardware.
Big Data Programming Using Hadoop WorkshopIMC Institute
The document describes instructions for running a hands-on workshop on Hadoop and MapReduce. It includes steps for starting a Cloudera VM, importing and exporting data from HDFS, and writing a basic word count MapReduce program. The MapReduce process involves a map step that processes input records and outputs key-value pairs, and a reduce step that combines all intermediate values associated with the same key.
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
This Hadoop MapReduce tutorial will unravel MapReduce Programming, MapReduce Commands, MapReduce Fundamentals, Driver Class, Mapper Class, Reducer Class, Job Tracker & Task Tracker.
At the end, you'll have a strong knowledge regarding Hadoop MapReduce Basics.
PPT Agenda:
✓ Introduction to BIG Data & Hadoop
✓ What is MapReduce?
✓ MapReduce Data Flows
✓ MapReduce Programming
----------
What is MapReduce?
MapReduce is a programming framework for distributed processing of large data-sets via commodity computing clusters. It is based on the principal of parallel data processing, wherein data is broken into smaller blocks rather than processed as a single block. This ensures a faster, secure & scalable solution. Mapreduce commands are based in Java.
----------
What are MapReduce Components?
It has the following components:
1. Combiner: The combiner collates all the data from the sample set based on your desired filters. For example, you can collate data based on day, week, month and year. After this, the data is prepared and sent for parallel processing.
2. Job Tracker: This allocates the data across multiple servers.
3. Task Tracker: This executes the program across various servers.
4. Reducer: It will isolate the desired output from across the multiple servers.
----------
Applications of MapReduce
1. Data Mining
2. Document Indexing
3. Business Intelligence
4. Predictive Modelling
5. Hypothesis Testing
----------
Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance.
Email: sales@skillspeed.com
Website: https://www.skillspeed.com
The document is a presentation on big data and Hadoop. It introduces the speaker, Adam Muise, and discusses the challenges of dealing with large and diverse datasets. Traditional approaches of separating data into silos are no longer sufficient. The presentation argues that a distributed system like Hadoop is needed to bring all data together and enable it to be analyzed as a whole.
Hadoop at the Center: The Next Generation of HadoopAdam Muise
This document discusses Hortonworks' approach to addressing challenges around managing large volumes of diverse data. It presents Hortonworks' Hadoop Data Platform (HDP) as a solution for consolidating siloed data into a central data lake on a single cluster. This allows different data types and workloads like batch, interactive, and real-time processing to leverage shared services for security, governance and operations while preserving existing tools. The HDP also enables new use cases for analytics like real-time personalization and segmentation using diverse data sources.
This document discusses big data and Hadoop. It defines big data as large amounts of unstructured data that would be too costly to store and analyze in a traditional database. It then describes how Hadoop provides a solution to this challenge through distributed and parallel processing across clusters of commodity hardware. Key aspects of Hadoop covered include HDFS for reliable storage, MapReduce for distributed computing, and how together they allow scalable analysis of very large datasets. Popular users of Hadoop like Amazon, Yahoo and Facebook are also mentioned.
This document defines key terms related to big data such as structured data, unstructured data, and semi-structured data. It discusses how data is generated from various sources and factors like sensors, social networks, and online shopping. It explains that big data refers to data that is too large to process using traditional methods due to its volume, velocity, and variety. Hadoop is introduced as an open source framework that uses HDFS for distributed storage and MapReduce for distributed processing of large data sets across computer clusters.
This document provides an overview of Hadoop, a tool for processing large datasets across clusters of computers. It discusses why big data has become so large, including exponential growth in data from the internet and machines. It describes how Hadoop uses HDFS for reliable storage across nodes and MapReduce for parallel processing. The document traces the history of Hadoop from its origins in Google's file system GFS and MapReduce framework. It provides brief explanations of how HDFS and MapReduce work at a high level.
Big data refers to large volumes of data that are diverse in type and are produced rapidly. It is characterized by the V's: volume, velocity, variety, veracity, and value. Hadoop is an open-source software framework for distributed storage and processing of big data across clusters of commodity servers. It has two main components: HDFS for storage and MapReduce for processing. Hadoop allows for the distributed processing of large data sets across clusters in a reliable, fault-tolerant manner. The Hadoop ecosystem includes additional tools like HBase, Hive, Pig and Zookeeper that help access and manage data. Understanding Hadoop is a valuable skill as many companies now rely on big data and Hadoop technologies.
The document provides an introduction to big data and Hadoop. It defines big data as large datasets that cannot be processed using traditional computing techniques due to the volume, variety, velocity, and other characteristics of the data. It discusses traditional data processing versus big data and introduces Hadoop as an open-source framework for storing, processing, and analyzing large datasets in a distributed environment. The document outlines the key components of Hadoop including HDFS, MapReduce, YARN, and Hadoop distributions from vendors like Cloudera and Hortonworks.
Big Data Analytics & Trends Presentation discusses what big data is, why it's important, definitions of big data, data types and landscape, characteristics of big data like volume, velocity and variety. It covers data generation points, big data analytics, example scenarios, challenges of big data like storage and processing speed, and Hadoop as a framework to solve these challenges. The presentation differentiates between big data and data science, discusses salary trends in Hadoop/big data, and future growth of the big data market.
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson
Josh Patterson is a principal solution architect who has worked with Hadoop at Cloudera and Tennessee Valley Authority. Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. It allows for consolidating mixed data types at low cost while keeping raw data always available. Hadoop uses commodity hardware and scales to petabytes without changes. Its distributed file system provides fault tolerance and replication while its processing engine handles all data types and scales processing.
This presentation Simplify the concepts of Big data and NoSQL databases & Hadoop components.
The Original Source:
http://zohararad.github.io/presentations/big-data-introduction/
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
Robin Bloor and Teradata
Live Webcast on April 22, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=2e69345c0a6a4e5a8de6fc72652e3bc6
Can you replace the data warehouse with Hadoop? Is Hadoop an ideal ETL subsystem? And what is the real magic of Hadoop? Everyone is looking to capitalize on the insights that lie in the vast pools of big data. Generating the value of that data relies heavily on several factors, especially choosing the right solution for the right context. With so many options out there, how do organizations best integrate these new big data solutions with the existing data warehouse environment?
Register for this episode of The Briefing Room to hear veteran analyst Dr. Robin Bloor as he explains where Hadoop fits into the information ecosystem. He’ll be briefed by Dan Graham of Teradata, who will offer perspective on how Hadoop can play a critical role in the analytic architecture. Bloor and Graham will interactively discuss big data in the big picture of the data center and will also seek to dispel several common misconceptions about Hadoop.
Visit InsideAnlaysis.com for more information.
Learn About Big Data and Hadoop The Most Significant ResourceAssignment Help
Data is now one of the most significant resources for businesses all around the world because of the digital revolution. However, the ability to gather, organize, process, and evaluate huge volumes of data has altered the way businesses function and arrive at educated decisions. Managing and gleaning information from the ever-expanding marine environments of information is impossible without Big Data and Hadoop. Both of which are at the vanguard of this data revolution.
If you have selected a programming language, and have difficulties writing the best assignment, get the assistance of assessment help experts to learn more about it. In this blog, we will look at the basics of Big Data and Hadoop and how they work. However, we will also explore the nature of Big Data. Also, its defining features, and the difficulties it provides. We'll also take a look at how Hadoop, an open-source platform, has become a frontrunner in the race to solve the challenges posed by Big Data. These fully appreciate the potential for change of Big Data and Hadoop for businesses across a wide range of sectors. It is necessary first to grasp the central position that they play in current data-driven decision-making.
Apache Big Data Europa- How to make money with your own dataJorge Lopez-Malla
This document discusses how Stratio used big data technologies like Apache Spark to help a middle eastern telecommunications company with data challenges. It describes Stratio as the first Spark-based big data platform and discusses how they helped the telco process over 9.5 million daily events from 9.2 million customers. Specifically, Stratio used Spark and its machine learning library MLLib to build models from millions of data points to recognize patterns and improve network coverage, gather customer insights, and monetize data.
Yahoo is the largest corporate contributor, tester, and user of Hadoop. They have 4000+ node clusters and contribute all their Hadoop development work back to Apache as open source. They use Hadoop for large-scale data processing and analytics across petabytes of data to power services like search and ads optimization. Some challenges of using Hadoop at Yahoo's scale include unpredictable user behavior, distributed systems issues, and the difficulties of collaboration in open source projects.
Alexander Aldev - Co-founder and CTO of MammothDB, currently focused on the architecture of the distributed database engine. Notable achievements in the past include managing the launch of the first triple-play cable service in Bulgaria and designing the architecture and interfaces from legacy systems of DHL Global Forwarding's data warehouse. Has lectured on Hadoop at AUBG and MTel.
"The future of Big Data tooling" will briefly review the architectural concepts of current Big Data tools like Hadoop and Spark. It will make the argument, from the perspective of both technology and economics, that the future of Big Data tools is in optimizing local storage and compute efficiency.
1. The document discusses the future of data science and big data technologies. It describes the roles of data scientists and their typical skills, salaries, and job outlook.
2. It discusses technologies like Hadoop, Spark, and distributed computing that are used to handle big data. While Hadoop is good for batch processing, Spark can perform both batch and real-time processing 100x faster.
3. Going forward, data science will shift from descriptive to predictive analytics using machine learning to improve customer experience and business outcomes across industries like internet search and digital advertising.
The need to process huge data is increasing day by day. Processing huge data involves compute, network and storage. In terms of Big Data, What it takes to innovate and what is innovation at the end? This talk provide high level details on the need of big data and capabilities of Mapr converged data platform.
Speaker: Vijaya Saradhi Uppaluri, Technical Director at MapR Technologies
Big Data with IOT approach and trends with case studySharjeel Imtiaz
Big data and IoT technologies are increasingly being used together for new applications. The document discusses using big data and IoT for tourism recommendations in Oman. It outlines a case study approach involving collecting hotel review data from TripAdvisor, analyzing the data using sentiment analysis and topic modeling, and developing a recommendation system. The system would integrate IoT devices in hotel rooms to gather additional guest feedback and preferences on amenities like lighting, music, and more. This combined big data and IoT approach aims to provide more personalized recommendations to improve the Omani tourism experience.
Hadoop is an open-source framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Hadoop supports the processing of structured, unstructured and semi-structured data and is able to reliably store and process petabytes of data. Some key applications of Hadoop include web search indexing, data mining, machine learning, scientific data analysis, and business intelligence.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
1. On Which side of the Cloud are you ?
An Introduction to Big Data
Denis Rothman
Copyright 2014 Denis Rothman
2. Big Data - Introduction
□ This course is not meant to make Big
Data experts out of you in a few
hours but is designed to help you
grasp the main concepts.
□ We’ll be discussing Apache Hadoop,
MapReduce, Mongodb, Pig and
several other names and concepts
that will be familiar to you by the end
of the course !
Copyright 2014 Denis Rothman
3. Big Data - Introduction
□ We’re going to talk about Apache
« Hadoop » and « MapReduce »
because the following companies use
this technology, at least parent or
derived versions : Google,
Yahoo!,Facebook,Amazon, IBM, Ebay
and many more key players on the
market.
Copyright 2014 Denis Rothman
4. Big Data - Introduction
□ All the figures, software and brands
mentioned in this document are simple
examples. All of this is going to expand and
change through the years !
□ The main goal here is for you to grasp
enough concepts to be able to create Big
Data architectures with today’s but also
tomorrow’s technology and ideas !
□ So focus on the concepts and the way you
can solve problems with Big Data
technology.
Copyright 2014 Denis Rothman
5. Big Data – What is big data ?
Learn more : http://en.wikipedia.org/wiki/Big_data
Let’s say that starting with one 10TB for a dataset (collection of data) we’re
talking Big Data and starting one petabyte we really need the technology !
The world has jumped from talking petabytes to exabytes in a year, we’ll
probably be talking zettabytes.
1 EB = 1000000000000000000B = 1018bytes = 1000petabytes1 EB
= 1000000000000000000B = 1018bytes = 1000petabytes = 1 million1
EB = 1000000000000000000B = 1018bytes = 1000petabytes =
1 million terabytes1 EB
= 1000000000000000000B = 1018bytes = 1000petabytes =
1 million terabytes = 1 billion1 EB
= 1000000000000000000B = 1018bytes = 1000petabytes =
1 million terabytes = 1 billion gigabytes.
Copyright 2014 Denis Rothman
6. Big Data – What is big data ?
For the Universe, the galaxies
are our small representative
volumes, and there are
something like 10^11 to
10^12 stars in our Galaxy
(The Milky Way)
•The number of bitsThe
number of bits on a
computer tera capacity hard
disk is typically about 10^13,
1000 GB)
To compare the amount of data we now store we have to do
down to atom level quantities in our universe !
Copyright 2014 Denis Rothman
7. Big Data – Can you represent the
Volume ?
Learn more : http://www.seagate.com/about/newsroom/press-
releases/Terascale-Enterprise-HDD-pr-master/
Tell us how and were you
would store a 1PB dataset for a
given company without Big
Data technology ?
How many average size 4 TB
hard disks would it take to
simple store the data ?
High-Capacity— highest capacity HDD (4TB) available in a 3.5-
inch enterprise-class SATA(Serial Advanced Technology Attachment)
HDD enabling scalable, high-capacity storage in 24×7
environments.
?
Copyright 2014 Denis Rothman
8. Big Data – Can you represent a fast
way to access(Velocity) 1PB of data
with Big Data technology?
Let’s say we’re talking
about the data related to
all bank accounts of the
BNP of the past 5 years
that had a balance of
more that 1000 $ at given
time and that need to be
accessed for a financial
analysis.
How would you do it now,
without Big Data
Technology ?
Copyright 2014 Denis Rothman
9. Big Data – Can you represent to access
additional documents in a great Variety of
data ?
Now we need to retrieve
other documents to
analyse these BNP
accounts : text
documents(signed
contracts, for example)
How would you do it now,
without Big Data
Technology ?
Copyright 2014 Denis Rothman
10. Big Data – Do you think you can manage
10PB without Big Data ?
If we now try to solve the 3 V problem with a 10PB dataset to
manage, how could we do it even with Oracle Big Files ?
A bigfile tablespace contains only
one datafile or tempfile, which
can contain up to approximately
4 billion ( 232 ) blocks. The
maximum size of the single
datafile or tempfile is 128
terabytes (TB) for a tablespace
with 32 K blocks and 32 TB for
a tablespace with 8 K blocks.
Number of
blocks
Bigfile Tablespaces
Learn more :
http://docs.oracle.com/cd/B28359_01/server.111/b28320/limits002.htm#i2879
15
Copyright 2014 Denis Rothman
11. Big Data – Volume, Velocity, Variety that is
beyond non Big Data solutions
We seen the limits of non
Big Data technology ?
How would you solve the
problem ?
Even if you already know
how Big Data works, do
you think it will solve the
increasing size and
variety of datasets ?
How will it help with
sensors ?
Copyright 2014 Denis Rothman
12. Big Data – Apache Hadoop
There are several
solutions on the
market. Let’s use
Apache Hadoop as
way to understand
how Big Data storage
works to solve the 3V
problem.
Learn more : http://en.wikipedia.org/wiki/Apache_Hadoop
Copyright 2014 Denis Rothman
13. Big Data – Apache Hadoop
□ There are many ways
to try to understand a
subject. This part of
the course is designed
for you to see that
the core ideas of
Apache Hadoop are
simple !
Copyright 2014 Denis Rothman
14. Big Data – Apache Hadoop
□ First of all, what
does « Hadoop »
mean ? It means
nothing !
□ Doug Cutting just
named after his
son’s toy elephant.
So that’s one
mystery solved.
Copyright 2014 Denis Rothman
15. Big Data – Apache Hadoop
□ The first thing
we need to do is
understand
cluster
architectures.
□ Cluster
architectures are
spreading at a
wild speed as a
framework for
the analyis of big
data.
New Exabytes of data appear
each…week…
Learn more : http://www.ovh.com/fr/serveurs_dedies/big-data/
16. Big Data – Apache Hadoop
□ Cluster architectures are the
best choice because they
have Cloud performances :
extensible, flexible and cost
efficient.
Copyright 2014 Denis Rothman
17. Big Data – Apache Hadoop
□ So what ? So
what’s the
difference between
a traditional
entreprise
architecture and a
cloud-cluster
architecture ?
Copyright 2014 Denis Rothman
18. Big Data – Apache Hadoop
□ A traditional
architecture is
built on
server technology
that is expensive
and thus has to be
used as much as
possible.
Copyright 2014 Denis Rothman
19. Big Data – Apache Hadoop
□ A traditional
architecture is
also built on
storage capacity
of different sizes
and types : SSD
to SATA.
Copyright 2014 Denis Rothman
20. Big Data – Apache Hadoop
□ A traditional
architecture is
finally built on
storage area
networks
(SAN) to
connect a set
of servers to a
set of storage
units
Copyright 2014 Denis Rothman
21. Big Data – Apache Hadoop
□ The big quality of traditional
architecture is that the servers and
storage units can be managed (size,
number) separately with SAN
(Storage Area Network) connecting
them.
Copyright 2014 Denis Rothman
22. Big Data – Apache Hadoop
□ The big drawback of traditional
architecture is that it must be
extremely reliable and any failure
must be dealt with very quickly.
□ This brings the price up.
Copyright 2014 Denis Rothman
23. Big Data – Apache Hadoop
□ Traditional architectures were
designed for intensive applications
focusing on one part of the data. The
servers process the information and
then the results are transferred to
storage.
Copyright 2014 Denis Rothman
24. Big Data – Apache Hadoop
□ So in essence a traditional architecture is
designed for a specific need (intense
computing, a standard data warehouse.
Fine.
□ How would you now solve a problem
involving a tremendous weekly increase in
data (PB) ? Not knowing what you’re
looking for in advance : sorting by order,
by timestamp or retrieving certain values.
Copyright 2014 Denis Rothman
25. Big Data – Apache Hadoop
□ Even a few years ago Google was
facing a daily increase of data of
20PB…per day.
□ For a special operation, let’s say user
mail history (number and size of
mails over a five year period), we
need to parse the entire dataset not
just a subset.
Copyright 2014 Denis Rothman
26. Big Data – Apache Hadoop
□ Why sort that data ?
□ To make searching, merging and
analyzing easier.
□ So how can you sort n x 20PB of
data?
□ With cluster architecture !
Copyright 2014 Denis Rothman
27. Big Data – Apache Hadoop
Let’s now study 3 basic
properties of cluster computing :
-Pennysort
-Minutesort
-Graysort
Copyright 2014 Denis Rothman
28. Big Data – Apache Hadoop
□ Sorting being a major function of Big
Data, it’s important to have
benchmark references.
Learn more : http://sortbenchmark.org/
GraySort
Metric: Sort rate (TBs / minute) achieved while sorting a very large
amount of data (currently 100 TB minimum).
PennySort
Metric: Amount of data that can be sorted for a penny's worth of system
time.
Originally defined in AlphaSort paper.
MinuteSort
Metric: Amount of data that can be sorted in 60.00 seconds or less.
Originally defined in AlphaSort paper.
Copyright 2014 Denis Rothman
29. Big Data – Apache Hadoop
Learn more : http://sortbenchmark.org/
2013, 1.42 TB/min
Hadoop
102.5 TB in 4,328 seconds
2100 nodes x
(2 2.3Ghz hexcore Xeon E5-2630, 64
GB memory, 12x3TB disks)
Thomas Graves
Yahoo! Inc.
Gray
Copyright 2014 Denis Rothman
30. Big Data – Apache Hadoop
2011, 286 GB
psort
2.7 Ghz AMD Sempron, 4 GB RAM,
5x320 GB 7200 RPM Samsung SpinPoint F4
HD332GJ, Linux
Paolo Bertasi, Federica Bogo, Marco Bressan
and Enoch Peserico
Univ. Padova, Italy
Penny
Copyright 2014 Denis Rothman
31. Big Data – Apache Hadoop
2012, 1,401 GB
Flat Datacenter Storage
256 heterogeneous nodes, 1033 disks
Johnson Apacible, Rich Draves, Jeremy Elson,
Jinliang Fan, Owen Hofmann, Jon Howell, Ed
Nightingale, Reuben Olinksy, Yutaka Suzue
Microsoft ResearchMinute
Copyright 2014 Denis Rothman
32. Big Data – Apache Hadoop
□ Getting down to a cluster.
A cluster breaks down to its basic
component : a NODE
A node is made up of cores, memory
and disks that can be assembled in
the thousands, the tens of thousands,
the hundreds of thousands.
Copyright 2014 Denis Rothman
33. Big Data – Apache Hadoop
□ The NODES
are then
grouped in
RACKS
□ The RACKS
are then
grouped into
CLUSTERS
The CLUSTERS ARE CONNECTED TO A NETWORK WITH A CISCO
SWITCH, for example
Copyright 2014 Denis Rothman
34. Big Data – Apache Hadoop
□ The first property of a cluster is to be
MODULAR and SCALABLE (handles
growing amount of elements)
□ This means that it’s cheap to just add
more and more nodes at the best
price and it doesn’t need to be that
reliable as we will see further.
Copyright 2014 Denis Rothman
35. Big Data – Apache Hadoop
□ The second property of a cluster is
DATA LOCALITY. This means your not
going through a sequence but directly
to the physical location. No more
bottlenecks...
□ This leads to PARALLELIZATION which
means you access several locations
simultaneously.
Learn more : http://en.wikipedia.org/wiki/Locality_of_reference
Copyright 2014 Denis Rothman
36. Big Data – Apache Hadoop
□ With data locality and parallelization
MASSIVE PARALLEL PROCESSING
becomes a reality.
□ The main function, sorting, can now
be done within each node on a subset
of data.
□ Please bear in mind that these nodes
are cheaper than traditional
architectures.
Copyright 2014 Denis Rothman
37. Big Data – Apache Hadoop
□ This is just an example that goes
back to 2011 but makes the point.
A typical SSD drive system would
process data at about $1.2 a gigabyte
at 30K IOPS and a SATA at about
$0.05 but only at 250 IOPS
IOPS (input/output operations per
second) .
Let’s take a simple cluster…
Copyright 2014 Denis Rothman
38. Big Data – Apache Hadoop
□ In a simple cluster, 30 000 IOPS
are delivered in parallel with around
120 nodes (around 250 IOPS) at the
same time BUT for the IOP price of
SATA.
□ We’re talking about cheaper and
more expendable equipment.
Copyright 2014 Denis Rothman
39. Big Data – Map Reduce
□ This means that in a cluster
architecture failures will be more
frequent with cheaper equipement.
Copyright 2014 Denis Rothman
40. □ Failures with cheaper equipement ?
Who cares ? Don’t get ripped off and purchase
expensive reliable hardware but expendable
material to be cost efficient.
We just need to find a way to detect and
respond quickly to deal with this
complexity.
We’ll need to replicate the date up to three
times in three different data locations.
Let’s see how to solve these problems with
Apache Hadoop.
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
41. Hadoop is about clusters build with
commodity hardware not high quality
hardware :
• widely available
• interchangeable
• plug and play
• breaks down more often
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
42. □ Before we go on, what’s
the purpose of all this.
WHY ?
It all started with Google who
had to index pages every
day and quickly reach huge
amounts of data. Hadoop
reaches back into the
Google File System (GFS)
and Google MapReduce. In
the early days, Yahoo ! and
Apache got involved in the
process.
Around 2004, Google started
publishing all this…
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
43. □ Let’s take Facebook. You know all the
information that’s in there for you. But with
over 1 000 000 000 users + the 450 000
000 WhatsApp we’re talking about a
massive chunk of the world population
increasing the size of Facebook every day.
We’re talking increasing data in Exabytes
in this case. How are you going to run a
search over that one dataset spread over
hundreds of thousands of nodes ?
With Apache Hadoop !
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
44. Big Data – Apache Hadoop
Apache Hadoop was designed for DISTRIBUTED DATA OVER THE CLUSTERS
Apache Hadoop was designed with the concept of DATA LOCALITY
Hadoop Distributed File
System (HDFS)
Hadoop Map Reduce
Copyright 2014 Denis Rothman
45. □ HDFS has 3 main functions : split,
scatter and replicate.
Big Data – Apache Hadoop
1. SPILTING. In Hadoop
each FILE BLOCK has the
SAME size (64 Mb for
example) in a STORAGE
BLOCK
2.Scattering. These FILE
BLOCKS are generally
on different datanodes
3.REPLICATION : There are
multiple copies of these
blocks in different
locations.
Copyright 2014 Denis Rothman
46. Big Data – Apache Hadoop
Architecture
Learn more : http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Blocks
One main
node
Generally
3 copies in
the
replication
process so
nodes can
fail !
Copyright 2014 Denis Rothman
47. □ The NameNode is the centerpiece
of an HDFS file system. It keeps
the directory tree of all files in the
file system, and tracks where
across the cluster the file data is
kept. It does not store the data of
these files itself.
□ Client applications talk to the
NameNode whenever they wish to
locate a file, or when they want to
add/copy/move/delete a file. The
NameNode responds the
successful requests by returning a
list of relevant DataNode servers
where the data lives(addresses).
Big Data – Apache Hadoop
Learn more : http://wiki.apache.org/hadoop/NameNode
Works fine for
failures on
commodity
equipment !
Copyright 2014 Denis Rothman
48. □ So what happens when the
NameNode fails.
□ Hadoop has copies of the data and
as long as the same IP address is
reassigned, a new NameNode will be
designated and that’s it !
Big Data – Apache Hadoop
Learn more : http://wiki.apache.org/hadoop/NameNode
Copyright 2014 Denis Rothman
49. Once the HDFS is set up, MAP REDUCE
is there to retrieve information in a
simple way.
First a MAPPER is user then the
information is REDUCED.
Let’s see how this happens.
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
50. The MAPPER function relies on the fact
that the data is EVENLY
DISTRIBUTED. This means that
Massive Parallel Processing is
possible.
The MAPPER uses the LOCALITY (hence
« MAP » features of HADOOP to
optimize it’s search.
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
51. □ If not the file blocks were not of equal size,
the processing time would be equal to the
largest file.
□ But since in Hadoop the file blocks have the
same size, processing is tremendously
enhanced for MPP.
□ A little caveat could be Internet unequal
internet connexions but most organizations
have solved this and there are replications
everywhere…
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
52. Big Data – Apache Hadoop
Suppose you need to analyse
the number of times the word
« Happy New Year » in a
Google search at midnight on
Decembre 31rst in their
timezone.
Let’s say we’re concentrating
on France only and that the
nodes containing this data are
Nodes 1,2,3 (at their
address)
Copyright 2014 Denis Rothman
53. □ Now we run a <key,value> pair withe
the mapping functions. They key here
is « Happy New Year » and the value
will be the number of times it
appears.
□ In Node 1: <Happy New Year,
1000000, Node 2 : <Happy New
Year, 4000000>, Node 3: <Happy
New Year, 2000000>
Big Data – Apache Hadoop
Copyright 2014 Denis Rothman
54. Big Data – Apache Hadoop
□ Let’s get a look and feel of Hadoop
command line functions, among
others.
□ https://hadoop.apache.org/docs/r0.1
8.3/hdfs_shell.html
Copyright 2014 Denis Rothman
55. Big Data – Map Reduce
□ In Node 1: <Happy New Year, 1000000, Node 2 :
<Happy New Year, 4000000>, Node 3: <Happy New
Year, 2000000> data is sent to a reduce node to run
the REDUCE function which will give the following
output:
<Happy New Year, 1000000,4000000,2000000> to be
summed up for example to <Happy New Year,
7000000>
Mapping and reducing are thus 2 simple but powerful
functions.
If various keys are sent, they are SORTED through a
shuffling process.
Copyright 2014 Denis Rothman
56. Big Data – Map Reduce
□ The Mapper functions and Reduce functions
are TASKS and together they form a JOB.
□ Map Reduce’s framework has a JOB
TRACKER that schedules the tasks.
□ A JOB TRACKER will reroute tasks if a node
fails, it organizes the activities.
□ Just like HDFS has a name node, Map
Reduce has a special node assigned to the
JOB TRACKER.
Copyright 2014 Denis Rothman
57. Big Data – Map Reduce
□ Now the programmer will provide
MapReduce with a list of file blocks, the
map and jobs.
□ The output is a set of keys and values.
□ All of this can be done in a tremendous MPP
run.
□ By 2015, it’s estimated that 50% of all data
will be processed with Hadoop…
Copyright 2014 Denis Rothman
58. Big Data – High level software
□ Now the programmer will provide
MapReduce with a list of file blocks, the
map and jobs.
□ The output is a set of keys and values.
□ All of this can be done in a tremendous MPP
run.
□ By 2015, it’s estimated that 50% of all data
will be processed with Hadoop type
technology…
Copyright 2014 Denis Rothman
59. Getting Started with Hadoop
MapReduce
Now let’s get Hadoop
MapReduce into the equation
Learn more: http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html
http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html#Pre-requisites
Let’s get a look and feel of MapReduce functions :
http://hadoop.apache.org/docs/r0.18.3/mapred_tutorial.html#Example
%3A+WordCount+v1.0
Just bear in mind that you looking at developing
<key,value> sets both mapping them and reducing them.
Copyright 2014 Denis Rothman
60. MapReduce
More look and
feel
approaches :
http://hadoop.a
pache.org/doc
s/r2.2.0/api/o
rg/apache/had
oop/mapreduc
e/Mapper.html
Copyright 2014 Denis Rothman
61. Apache Hadoop MapReduce
Architecture
□ Let’s take five here and see what
we’ve got up to here. Ok, we have
Hadoop and MapReduce.
□ Let’s see how this fits together and
how we can access data at a higher
level.
□ We’re going to take a look at how
Google explains this…
Copyright 2014 Denis Rothman
62. Apache Hadoop MapReduce
Architecture
Google explains it
with this
concept with
physical
retrieval :
1.Standard software
query : 1
person
2.MapReduce :
several persons
Let’s work on this
physical file
system
Learn more : https://cloud.google.com/developers/articles/apache-hadoop-
hive-and-pig-on-google-compute-engine#appendix-b
Copyright 2014 Denis Rothman
63. Getting Started with PIG
All the tools are there,
just use them !
You’re going to have to choose a
platform or just rent one as
explained further in the
document.
Copyright 2014 Denis Rothman
64. PIG
Let’s have some fun
with high level
programming !
« Pig is a high-level platform for
creating MapReduce programs used
with Hadoop. »
Learn more : http://en.wikipedia.org/wiki/Pig_(programming_tool)
What does a pig do ? it « grunts »
You can use Grunt to run Pig, you can
use Pig to run Python code, you can
use Pig for the MapReduce
framework.
Just stop thinking « categories »,be
Creative and have fun !
Copyright 2014 Denis Rothman
65. PIG
Learn more : http://en.wikipedia.org/wiki/Pig_(programming_tool)
http://pig.apache.org/docs/r0.8.1/udf.html#Aggregate+Functions
Let’s have a look at
some of the PIG
functions to get the feel
of it.
http://docs.aws.amazon.com/ElasticMapReduce/lat
est/DeveloperGuide/emr-pig-udf.html
Copyright 2014 Denis Rothman
66. What if I don’t want to use Pig ?
There are a lot of languages you can use that
integrate the Hadoop & MapReduce framework !
Java : http://www.javacodegeeks.com/2013/08/writing-a-hadoop-
mapreduce-task-in-java.html
PHP : http://stackoverflow.com/questions/10978975/need-a-map-
reduce-function-in-mongo-using-php
C++ : http://cxwangyi.blogspot.fr/2010/01/writing-hadoop-programs-
using-c.html
Python :
https://developers.google.com/appengine/docs/python/dataprocessing/h
elloworld
Copyright 2014 Denis Rothman
67. Big Data or Standard Databases ?
□ File Systems or
databases ?
□ So now what ?
SQL solutions ?
No SQL solutions ?
□ Both ?
Let’s take a few minutes and find some examples in
which one philosphy or another is best for a company
SQL ?
No SQL ?
Copyright 2014 Denis Rothman
68. Big Data – NOSQL
Learn more : http://en.wikipedia.org/wiki/NoSQL
□ First let’s get rid of a simple and old concept : SQL
□ When you want to explore exabytes, of data, SQL is
useless.
□ « the term was used in NoSQL(Not Only SQL) in 1998
to name a lightweight, open source database that did
not expose the standard SQL interface. Strozzi
suggests that, as the current NoSQL movement
"departs from the relational model altogether; it
should therefore have been called more appropriately
'NoREL'. »
□ In somes cases the volume of data and it’s nature
(documents, texts) can’t be accessed through SQL
Copyright 2014 Denis Rothman
69. Big Data – NOSQL
□ « Some notable implementations of NoSQL
are Facebook's Cassandra database,
Google's BigTable and Amazon's SimpleDB
and Dynamo. »
□ Let’s approach NOSQL with one of its core
concepts. In a RDMS
(relational database management system)
several users can’t modify exactly the same
record at the same time. The system is
base on read-write-relational functions.
Copyright 2014 Denis Rothman
70. Big Data – NOSQL
In an RDMS, the last user that writes in
exactly the same record will overide
previous records. Of course you can
append a record per user but then
you have multiple records for the
same data index.
So you generally you lock the record
while it’s in use or use a LIFO(Last In
First Out)
Copyright 2014 Denis Rothman
71. Big Data – NOSQL
Learn more : http://www.techopedia.com/definition/27689/nosql-database
The fundamental difference in NOSQL is
that the relations don’t matter
anymore, so unique keys don’t
matter either.
You’re not worried about read and write
rules, relations, inner joins, size
constraints, time contraints.
Copyright 2014 Denis Rothman
72. Big Data – NOSQL
Learn more : http://en.wikipedia.org/wiki/NoSQL
With NOSQL you can
scatter your data
everywhere, on
various servers at
the same time
and write multiple
records with
multiple
simultaneous
users with millions
of same type
entries !
Copyright 2014 Denis Rothman
73. Big Data – SQL, Data Warehouse
and perspective
Let’s make NOSQL concepts clear :
-Hive is language that is SQL related and used with Big Data
-Pig is a NoSQL language
- You can use both in project !
http://gcn.com/blogs/reality-check/2014/01/hadoop-vs-data-
warehousing.aspx
A traditional Datawarehouse feeds data into a relational database.
What about a Hadoop Datawarehouse ? Why not ?
Perspective : Stop thinking of a data flow from a client
to server, start thinking about a universe of scattered
data ! Think from the point of view of the crowd not
the individual. Stop thinking about a single solution,
just use everything you can to reach your goal !
Copyright 2014 Denis Rothman
74. MongoDB
Learn more : http://www.mongodb.org/
Whereas Apache Hadoop is based on HFDS, MongoDB is a
NOSQL document database.
-Document-Orientated Storage with JSON style documents
-Index support
-Querying
-Map/Reduce
Copyright 2014 Denis Rothman
75. MongoDB
http://docs.mongodb.org/manual/core/map-reduce/
Let’s get the feel of Mongodb and MapReduce functions
So, now continue to stop thinking. Oh, i’m into Relational Databases and
this is a non relational database. What do I have to choose.
You don’t have to choose !
At one point Facebook, and this might still be true, gathered data in
MySQL, sent it out to Hadoop and then retrieved it with MapReduce :
mapping it, shuffling it, reducing it and making into sense back …in
MySQL for its users !!!
Copyright 2014 Denis Rothman
76. Purchasing and managing your
« Hadoop-MapReduce-MongdoDB,
PIG » Architecture
Learn more : http://en.wikipedia.org/wiki/Apache_Hadoop
□ First you need to set up or choose a
type of physical Cloud architecture.
□ You need to make an financial and
technical decision.
□ If your company is not big enough to
build it’s own cluster, then you need
to choose cloud offers.
Copyright 2014 Denis Rothman
77. Getting Started with Hadoop
Learn more : http://www.ovh.com/fr/serveurs_dedies/big-data/
Copyright 2014 Denis Rothman
78. Getting Started with Hadoop
□ Just a concept to bear in mind but you don’t have to do it on your
own as explained previously. Cloud services provide this.
□ "You have 10 machines connected in LAN and i need to create
Name Node in one system and Data Nodes in remaining 9
machines .
□ For example you have ( 1.. 10 ) machines , where machine1
is Server and from machine(2..9) are slaves[Data Nodes] so
do i need to install Hadoop on all 10 machines ?
□ You need Hadoop installed in every node and each node should
have the services started as for appropriate for its role. Also the
configuration files, present on each node, have to coherently
describe the topology of the cluster, including location/name/port
for various common used resources (eg. namenode). Doing this
manually, from scratch, is error prone, specially if you never did
this before and you don't know exactly what you're trying to do.
Also would be good to decide on a specific distribution of Hadoop
(HortonWorks, Cloudera, HDInsight, Intel, etc) »
Copyright 2014 Denis Rothman
79. Getting Started with Hadoop
Do you have an Amazon account ?
What do you know about what’s beyond your account ?
Does Amazon have Big Data Technology ?
How far does Amazon go in this field ?
Let’s see…
Copyright 2014 Denis Rothman
80. Getting Started with Hadoop
Learn more : http://aws.amazon.com/big-data/
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-what-is-
emr.html
Copyright 2014 Denis Rothman
81. Getting Starting with your Big Data
Architecture
Let’s have a look at a real Big Data account
and resource management interface.
http://aws.amazon.com/s3/pricing/http://aws
.amazon.com/s3/pricing/
https://console.aws.amazon.com/console/hom
e?region=eu-west-1#
https://console.aws.amazon.com/elasticmapre
duce/vnext/home?region=eu-west-
1#getting-started:
Copyright 2014 Denis Rothman
82. Big Data – Ebay
□ EBay has a nice way of summing it up
before we get down to analyzing.
http://www.ebaytechblog.com/2010/10/29/hadoop-
the-power-of-the-elephant/#.UxncJbV5Gx4
Copyright 2014 Denis Rothman
83. Analyst
The analysists are here
Let’s find out what they do and what you could do in the future !
Copyright 2014 Denis Rothman
84. Big Data – Analyst
First you need to forget about
consumption(sales, marketing) and all the
clichés you hear around you.
Why ? Because the first step is to set highly
creative goals, then to map, reduce and
transform them into useful data. Useful
data can be for medical research, police
departments, astronomy and many other
areas.
Copyright 2014 Denis Rothman
85. Big Data – Analyst
At Planilog, we created a powerful Advanced
Planning System that deals with the 3 Vs
(Volume, Velocity and Variety). Our APS
can optimize any field of data.
Without going into the detail of our APS
program, the following slides are going to
provide you with tools to begin analyzing.
Of course, you can analyze anything and
anyway you want. This is just a guideline
we used that help us solve hundreds of
problems.
Copyright 2014 Denis Rothman
86. Big Data – Analyst
Planilog’s first conceptual approch starts
with Cognitive Science and
Linguistics.
Human activity and be broken down into
two great categories :
Passive and active.
Copyright 2014 Denis Rothman
87. Analyst
Learn more : http://en.wikipedia.org/wiki/Apache_Hadoop
Let’s take some passive activities using
just one or two senses. You can easily
guess the others after.
Eyes :
• Watching (movies, events, any other)
• Reading
• Listening to music
Copyright 2014 Denis Rothman
88. Analyst
Learn more : http://en.wikipedia.org/wiki/Apache_Hadoop
Let’s take some active activities using
some senses. You can easily guess
the others after.
- Writing documents, chats, mails
• Talking over the phone
• Combining video and sound : Skype
Copyright 2014 Denis Rothman
89. Big Data – Analyst
Now that you have an idea of active and
passive activies, let’s see what they can
apply to and what we can get out them :
Thought process ->analyzing how someone
thinks (« Sentiment analysis »)
Feeling -> Sentiment analysis
Body -> Movement anaylsis (GPS, for
exemple).
Copyright 2014 Denis Rothman
90. Analyst
Learn more : http://en.wikipedia.org/wiki/Apache_Hadoop
Finally there are only two ways to measure passive/active
activities applied to the thinking-feeling-body process.
It just boils down to this
Qualitivative properties and quantitive analysis.
Once we know what we’re analysing and how much we
can pretty much make a model of the whole universe
!
We could sum it up with brackets :
<property or key, quantity> or if we simplify :
<key, value>
Sounds familiar ?
See the power ? See why you need to analyse what your
going to do before you analyse the data.
Copyright 2014 Denis Rothman
91. Analyst
The Hadoop tools available don’t need to be isolated in
terms of concepts but simple must be interoperable.
You can Sqoop data from relational dtabase and collect
event data with Flume. You have HDFS (distributed
files system) that you can acces in a non relational
way with Pig or even use in a Data Warehouse with
Hive. With MapReduce you can run parallel
computation. If you need more resources, you can
use Whirr to deploy more clusters and Zookeeper to
configure, manage and coordinate all of this !
So there is no relational, non relational opposition, there
is no « standard » approach. There is simply a goal to
attain with the best means possible.
Copyright 2014 Denis Rothman
92. Big Data – Privacy
Everything you touch is
stored, replicated,
mapped, reduced and
processed.
Just focus on the legal
aspect not on ethics.
Do you think all of this is
legal ?
What’s legal ? In which
country ? Where ? How ?
Can this be prevented ?
Copyright 2014 Denis Rothman
93. Big Data – On which side of the
Cloud are you ?
Now let’s forget about
the legal aspect.
How do you feel about
Clouds and Big Data ?
Do you feel threatened
?
Do you think it’s the
end of your freedom ?
Copyright 2014 Denis Rothman
94. Big Data – On which side of the
Cloud are you ?
Now if you feel it’s progress
with some drawbacks,
you’re ready to be an Big
Data analyst !
Do you agree with this or
not ?
Progress
Copyright 2014 Denis Rothman
95. Analyst : for those who agreed on
using Big Data ! ☺, the others can
leave ☺
Let’s sum it up before we begin analyzing real
projects and cases.
Conceptually if you use an active/passive
matrix applied to thought-feeling-physical
body, you can understand a great number
of models.
With Pig ->MapReduce->Hadoop and maybe
Mongdb add or not, you’re going to map,
reduce and transform DATA into useful
INFORMATION for decision making
processes. You’re exploring time and space.
Copyright 2014 Denis Rothman
96. Analyst : Can you imagine the data
to be mapped and retrieved in
various fields ?
You need to
think
differently.
Forget
everything
you
learned and
be open to
new, very
new ideas.
Let’s hit the
road now !
Copyright 2014 Denis Rothman
97. Oh, you think this is
theory for the future ?
□ Ok,well you can stop laughing. Let’s have a
look at sentiment analysis tools :
http://blog.mashape.com/post/48757031167/l
ist-of-20-sentiment-analysis-apis
How to feel about that ? Remember if you Tweet about this
page, it will be analyzed, so be careful of what you’re thinking
and writing !
https://www.mashape.com/
How is your mind shaped ?
How many applications are there out there ?
Think as a global data analyst and not an individual. Express your
thoughts.
99. Let’s carry out a little experiment
What do you think about Sentiment
analysis if you were Tweeting your
impression. Let’s analyze the
audience :
Key <positive , value>
Key <negative , value>
More difficult. Explain why.
Key <objective, value>
Key <subjective, value>
Copyright 2014 Denis Rothman
100. Big Data – Life saving
Try to find some ideas to save lives when there is a fire, or
to protect violence on women or any other idea that
comes to your mind.
Think social networks, think drones, swarms of robots, think from
the point of view of the swarm command, like in SC2 but to help
people not for the confort of an individual.
Copyright 2014 Denis Rothman
101. Big Data – Life saving
https://www.cmu.edu/silicon-valley/news-
events/news/2011/stamberger-interviewed.html
https://www.cmu.edu/silicon-valley/news-
events/news/2011/stamberger-interviewed.html
102. Big Data – Insurance
How can you optimize the price of the
premiums in real time worldwide with
Hadoop Mapreduce ?
Start with a major disaster and see how
you’re going to pay and forecast future
disasters.
Hadoop can be used for predictive functions.
Copyright 2014 Denis Rothman
103. Big Data – Insurance also needs
human resources.
How can you optimize
part time jobs in a
huge quantitative
environment in which
you have 100 000
employees to manage
?
http://www.optimaldecisionsllc.com/Welcome.html
104. Big Data – Amazon
Think of the passive-active matrix
and the related activities (thought,
feeling, physical) and tell me how
you would use Big Data.
How could you find a way to get
sentiment analysis out of the
reader ?
Copyright 2014 Denis Rothman
105. Big Data – Amazon-Kindle
What <key, value> pairs what you be looking for ?
Copyright 2014 Denis Rothman
106. Big Data – Tweeter
Try to find a great many
positive applications for
Tweeter.
We’ve seen the API’s.
Do you have ideas ?
Life Saving
Science and research
Other ?
We’re not going to talk about
the negative ones. You need
to think of how to go forward,
not slow down !
Copyright 2014 Denis Rothman
107. Big Data – Design Facebook
□ Describe the data
that Facebook
collects.
□ How can it legally
access 50% more
data that it didn’t
gather in the first
place….?
Copyright 2014 Denis Rothman
108. Big Data – Design Facebook
□ WhatAps ! 450 millions new users !
How would you Map, Shuffle and
Reduce this data to fit into your
Facebook strategy.
Advertising is a cliché.
What else can you do ?
Do you now what Stealth
Marketing is ?
http://en.wikipedia.org/wiki/Undercover_
marketing
How can you analyze and detect it
automatically if you were a
government consumer protection
agency ? Why wouldn’t
governments map,shuffle, and
reduce illegal behaviours ?
109. Big Data – Design Sony Smartband
http://www.expansys.fr/sony-smartband-swr10-with-2-black-wristbands-sl-
257855/?utm_source=google&utm_medium=shopping&utm_campaign=ba
se&mkwid=svHjLhmZB&kword=adwords_productfeed&gclid=CI36mJSLgb
0CFWjpwgod1wMANA
110. Smartwatches
□ Samsung has one too that measures
your heartbeat.
□ http://venturebeat.com/2013/09/01/t
his-is-samsungs-galaxy-gear-
smartwatch-a-blocky-health-tracker-
with-a-camera/
They want you to think about what you can do with the watch
while they’re thinking of what to do with the global data they’re
gathering as well. What could you analyse with Big Data
tools ?
Copyright 2014 Denis Rothman
113. Trucks : Tracking and Sensors
□ http://online.wsj.com/news/articles/S
B10001424127887324178904578340
071261396666
Sensors, robots, drones. Surveillance to optimize !
Give some of your ideas…
Copyright 2014 Denis Rothman
114. Big Data – Design Sony Smartband
Now you can pick up the pulse rate and all activity. How
can you relate this to all the other data on a group of
people and not just yourself ? Think of a concert and
sentiment analysis, for example.
http://www.sonymobile.com/us/products/accessories/sm
artwatch/#tabs
https://play.google.com/store/apps/details?id=com.sony
ericsson.extras.smartwatch&hl=fr
Copyright 2014 Denis Rothman
115. Governments and government
agencies
What can a governement collect
that corporations can’t ?
Can governments reach the level of
private corporations to protect
you ?
How can it be done ?
With what budget ?
116. Governments and government
agencies
Phone companies gather data.
Google, Microsoft and others gather
data
In fact everybody gathers data !
So could you gather as much data as
the government, in the end ?
Why or why not ?
117. Big Data – Can you trust yourself
to drive your car ?
□ Google, like all others, have you focus
on your individual need.
□ In the meantime, your personal data
has gone global.
□ Think global and tell me how you
would analyze the data.
In this case we’re dealing with Big Data streaming data like in
online gaming. So when you’re parsing the data, NoSQL or SQL
is not the issue, getting the right information straight is the vital
goal !
Copyright 2014 Denis Rothman
118. Big Data – Can you trust a human
to drive your car ?
A Google Car gathers about an average of
Gigabyte / second which is could add up to
over 80 TB a day. An you ?
□ http://www.isn.ethz.ch/Digital-
Library/Articles/Detail/?lng=en&id=173004
Google explains how it collects all types of data.
http://googlepolicyeurope.blogspot.fr/2010/04/da
ta-collected-by-google-cars.html
Where is all the data going : Big Data ?
http://www.hostdime.com/blog/google-self-
driving-car-news/
119. Now take a step back and imagine all
of the data gathered and accessed by
a single group of analysts…
…And now go out, imagine and conquer the world of Big Data !
120. Big Data – A New Data Paridigm :
no limits
You can ask your questions now or
contact me at
Denis.Rothman76@gmail.com
Copyright 2014 Denis Rothman