This document provides an overview of big data, including its definition, characteristics, categories, sources, storage, analytics, challenges and opportunities. Big data is large and complex datasets that are difficult to process using traditional database management tools. It is characterized by the 5 V's - volume, variety, velocity, value and veracity. Big data comes from both internal and external sources and can be structured, unstructured or semi-structured. It requires specialized storage technologies like Hadoop and NoSQL databases. Analytics on big data uses techniques like machine learning, regression analysis and social network analysis to gain insights. The growth of big data presents both challenges in processing diverse and voluminous data as well as opportunities to generate value.
The presentation covers following topics: 1) Hadoop Introduction 2) Hadoop nodes and daemons 3) Architecture 4) Hadoop best features 5) Hadoop characteristics. For more further knowledge of Hadoop refer the link: http://data-flair.training/blogs/hadoop-tutorial-for-beginners/
The document describes a 10 module data science course covering topics such as introduction to data science, machine learning techniques using R, Hadoop architecture, and Mahout algorithms. The course includes live online classes, recorded lectures, quizzes, projects, and a certificate. Each module covers specific data science topics and techniques. The document provides details on the course content, objectives, and topics covered in module 1 which includes an introduction to data science, its components, use cases, and how to integrate R and Hadoop. Examples of data science applications in various domains like healthcare, retail, and social media are also presented.
Introduction
Big Data may well be the Next Big Thing in the IT world.
Big data burst upon the scene in the first decade of the 21st century.
The first organizations to embrace it were online and startup firms. Firms like Google, eBay, LinkedIn, and Face book were built around big data from the beginning.
Like many new information technologies, big data can bring about dramatic cost reductions, substantial improvements in the time required to perform a computing task, or new product and service offerings.
This session focuses on Business Intelligence Best Practices with an emphasis on dashboard design and performance techniques. Learn about the different types of users and consumers of BI and how they impact your development strategy.
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
This document discusses big data, including its definition, characteristics, and architecture capabilities. It defines big data as large datasets that are challenging to store, search, share, visualize, and analyze due to their scale, diversity and complexity. The key characteristics of big data are described as volume, velocity and variety. The document then outlines the architecture capabilities needed for big data, including storage and management, database, processing, data integration and statistical analysis capabilities. Hadoop and MapReduce are presented as core technologies for storage, processing and analyzing large datasets in parallel across clusters of computers.
The presentation covers following topics: 1) Hadoop Introduction 2) Hadoop nodes and daemons 3) Architecture 4) Hadoop best features 5) Hadoop characteristics. For more further knowledge of Hadoop refer the link: http://data-flair.training/blogs/hadoop-tutorial-for-beginners/
The document describes a 10 module data science course covering topics such as introduction to data science, machine learning techniques using R, Hadoop architecture, and Mahout algorithms. The course includes live online classes, recorded lectures, quizzes, projects, and a certificate. Each module covers specific data science topics and techniques. The document provides details on the course content, objectives, and topics covered in module 1 which includes an introduction to data science, its components, use cases, and how to integrate R and Hadoop. Examples of data science applications in various domains like healthcare, retail, and social media are also presented.
Introduction
Big Data may well be the Next Big Thing in the IT world.
Big data burst upon the scene in the first decade of the 21st century.
The first organizations to embrace it were online and startup firms. Firms like Google, eBay, LinkedIn, and Face book were built around big data from the beginning.
Like many new information technologies, big data can bring about dramatic cost reductions, substantial improvements in the time required to perform a computing task, or new product and service offerings.
This session focuses on Business Intelligence Best Practices with an emphasis on dashboard design and performance techniques. Learn about the different types of users and consumers of BI and how they impact your development strategy.
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
This document discusses big data, including its definition, characteristics, and architecture capabilities. It defines big data as large datasets that are challenging to store, search, share, visualize, and analyze due to their scale, diversity and complexity. The key characteristics of big data are described as volume, velocity and variety. The document then outlines the architecture capabilities needed for big data, including storage and management, database, processing, data integration and statistical analysis capabilities. Hadoop and MapReduce are presented as core technologies for storage, processing and analyzing large datasets in parallel across clusters of computers.
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Edureka!
This Edureka Big Data tutorial helps you to understand Big Data in detail. This tutorial will be discussing about evolution of Big Data, factors associated with Big Data, different opportunities in Big Data. Further it will discuss about problems associated with Big Data and how Hadoop emerged as a solution. Below are the topics covered in this tutorial:
1) Evolution of Data
2) What is Big Data?
3) Big Data as an Opportunity
4) Problems in Encasing Big Data Opportunity
5) Hadoop as a Solution
6) Hadoop Ecosystem
7) Edureka Big Data & Hadoop Training
This document provides an overview of big data concepts including definitions of big data, characteristics of big data using the 5Vs model, common big data technologies like Hadoop and MapReduce, and use cases. It discusses how big data has evolved over time through increased data volumes and varieties. Key frameworks like HDFS and MapReduce that enable distributed storage and processing of large datasets are explained. Examples of big data applications in areas such as banking are also provided.
This document provides a syllabus for a course on big data. The course introduces students to big data concepts like characteristics of data, structured and unstructured data sources, and big data platforms and tools. Students will learn data analysis using R software, big data technologies like Hadoop and MapReduce, mining techniques for frequent patterns and clustering, and analytical frameworks and visualization tools. The goal is for students to be able to identify domains suitable for big data analytics, perform data analysis in R, use Hadoop and MapReduce, apply big data to problems, and suggest ways to use big data to increase business outcomes.
Getting real-time analytics for devices/application/business monitoring from trillions of events and petabytes of data like companies Netflix, Uber, Alibaba, Paypal, Ebay, Metamarkets do.
This document discusses various applications of big data across different domains. It begins by defining big data and its key characteristics of volume, variety and velocity. It then discusses how big data is being used in social media for recommendation systems, marketing, electioneering and influence analysis. Applications in healthcare discussed include personalized medicine, clinical trials, electronic health records, and genomics. Uses of big data in smart cities are also summarized, such as for smart transport, traffic management, smart energy, and smart governance. Specific examples and case studies are provided to illustrate the benefits and savings achieved from leveraging big data across these various sectors.
Raffael Marty gave a presentation on big data visualization. He discussed using visualization to discover patterns in large datasets and presenting security information on dashboards. Effective dashboards provide context, highlight important comparisons and metrics, and use aesthetically pleasing designs. Integration with security information management systems requires parsing and formatting data and providing interfaces for querying and analysis. Marty is working on tools for big data analytics, custom visualization workflows, and hunting for anomalies. He invited attendees to join an online community for discussing security visualization.
This document discusses cloud computing, big data, Hadoop, and data analytics. It begins with an introduction to cloud computing, explaining its benefits like scalability, reliability, and low costs. It then covers big data concepts like the 3 Vs (volume, variety, velocity), Hadoop for processing large datasets, and MapReduce as a programming model. The document also discusses data analytics, describing different types like descriptive, diagnostic, predictive, and prescriptive analytics. It emphasizes that insights from analyzing big data are more valuable than raw data. Finally, it concludes that cloud computing can enhance business efficiency by enabling flexible access to computing resources for tasks like big data analytics.
The document outlines a data science roadmap that covers fundamental concepts, statistics, programming, machine learning, text mining, data visualization, big data, data ingestion, data munging, and tools. It provides the percentage of time that should be spent on each topic, and lists specific techniques in each area, such as linear regression, decision trees, and MapReduce in big data.
What is Big Data?
Big Data Laws
Why Big Data?
Industries using Big Data
Current process/SW in SCM
Challenges in SCM industry
How Big data can solve the problems?
Migration to Big data for an SCM industry
Disclaimer :
The images, company, product and service names that are used in this presentation, are for illustration purposes only. All trademarks and registered trademarks are the property of their respective owners.
Data/Image collected from various sources from Internet.
Intention was to present the big picture of Big Data & Hadoop
Monetizing Big Data at Telecom Service ProvidersDataWorks Summit
Hadoop enables telecom companies to gain valuable insights from large amounts of customer data. It provides a cost-effective way to store and analyze call detail records, network traffic data, customer account information, and other big data sources. This allows telecoms to improve network maintenance, enhance the customer experience, optimize marketing campaigns, and reduce customer churn. The document discusses several use cases where telecom companies have used Hadoop to save millions of dollars annually or increase revenue through better data-driven decisions.
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
This presentation about Hadoop will help you learn the basics of Hadoop and its components. First, you will see what is Big Data and the significant challenges in it. Then, you will understand how Hadoop solved those challenges. You will have a glance at the History of Hadoop, what is Hadoop, the different companies using Hadoop, the applications of Hadoop in different companies, etc. Finally, you will learn the three essential components of Hadoop – HDFS, MapReduce, and YARN, along with their architecture. Now, let us get started with Introduction to Hadoop.
Below topics are explained in this Hadoop presentation:
1. Big Data and its challenges
2. Hadoop as a solution
3. History of Hadoop
4. What is Hadoop
5. Applications of Hadoop
6. Components of Hadoop
7. Hadoop Distributed File System
8. Hadoop MapReduce
9. Hadoop YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/introduction-to-big-data-and-hadoop-certification-training.
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
This presentation about Big Data will help you understand how Big Data evolved over the years, what is Big Data, applications of Big Data, a case study on Big Data, 3 important challenges of Big Data and how Hadoop solved those challenges. The case study talks about Google File System (GFS), where you’ll learn how Google solved its problem of storing increasing user data in early 2000. We’ll also look at the history of Hadoop, its ecosystem and a brief introduction to HDFS which is a distributed file system designed to store large volumes of data and MapReduce which allows parallel processing of data. In the end, we’ll run through some basic HDFS commands and see how to perform wordcount using MapReduce. Now, let us get started and understand Big Data in detail.
Below topics are explained in this Big Data presentation for beginners:
1. Evolution of Big Data
2. Why Big Data?
3. What is Big Data?
4. Challenges of Big Data
5. Hadoop as a solution
6. MapReduce algorithm
7. Demo on HDFS and MapReduce
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
This document provides an overview of big data analytics, including:
1) It defines big data and describes its key characteristics of variety, velocity, and volume.
2) It outlines common types of big data like structured, unstructured, and semi-structured data.
3) It lists sources of big data such as social media, the cloud, the web, databases, and the Internet of Things.
4) It discusses challenges of big data like rapid growth, storage, security, and integrating diverse data sources.
This document provides an introduction and overview of data science. It discusses Ravishankar Rajagopalan's educational and professional background working in data science. It then covers various topics related to data science including common applications, required skills, the typical project lifecycle, team aspects, career progression, interviews, and resources for learning. Examples of unusual real-world applications are also summarized, such as using machine learning to optimize inventory levels for an oil and gas company and implementing speech recognition to predict customer intent for a call center.
This document defines big data and discusses techniques for integrating large and complex datasets. It describes big data as collections that are too large for traditional database tools to handle. It outlines the "3Vs" of big data: volume, velocity, and variety. It also discusses challenges like heterogeneous structures, dynamic and continuous changes to data sources. The document summarizes techniques for big data integration including schema mapping, record linkage, data fusion, MapReduce, and adaptive blocking that help address these challenges at scale.
Big data is a huge volume of heterogenous data often generated at high speed.Big data cannot be handles with traditional data analytic tools. Hadoop is one of the mostly used big data analytic tool.Map Reduce, hive, hbase are also the tools for analysis in big data.
This document contains a laboratory manual for the Big Data Analytics laboratory course. It outlines 5 experiments:
1. Downloading and installing Hadoop, understanding different Hadoop modes, startup scripts, and configuration files.
2. Implementing file management tasks in Hadoop such as adding/deleting files and directories.
3. Developing a MapReduce program to implement matrix multiplication.
4. Running a basic WordCount MapReduce program.
5. Installing Hive and HBase and practicing examples.
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...Simplilearn
The presentation about Big Data Analytics will help you know why Big Data analytics is required, what is Big Data analytics, the lifecycle of Big Data analytics, types of Big Data analytics, tools used in Big Data analytics and few Big Data application domains. Also, we'll see a use case on how Spotify uses Big Data analytics. Big Data analytics is a process to extract meaningful insights from Big Data such as hidden patterns, unknown correlations, market trends, and customer preferences. One of the essential benefits of Big Data analytics is used for product development and innovations. Now, let us get started and understand Big Data Analytics in detail.
Below are explained in this Big Data analytics tutorial:
1. Why Big Data analytics?
2. What is Big Data analytics?
3. Lifecycle of Big Data analytics
4. Types of Big Data analytics
5. Tools used in Big Data analytics
6. Big Data application domains
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
The document provides an overview of key concepts in data science including data types, the data value chain, and big data. It defines data science as extracting insights from large, diverse datasets using tools like machine learning. The data value chain involves acquiring, processing, analyzing and using data. Big data is characterized by its volume, velocity and variety. Common techniques for big data analytics include data mining, machine learning and visualization.
The document discusses the syllabus for a course on Big Data Analytics. The syllabus covers four units: (1) an introduction to big data concepts like distributed file systems, Hadoop, and MapReduce; (2) Hadoop architecture including HDFS, MapReduce, and YARN; (3) Hadoop ecosystem components like Hive, Pig, HBase, and Spark; and (4) new features of Hadoop 2.0 like high availability for NameNode and HDFS federation. The course aims to provide students with foundational knowledge of big data technologies and tools for processing and analyzing large datasets.
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Edureka!
This Edureka Big Data tutorial helps you to understand Big Data in detail. This tutorial will be discussing about evolution of Big Data, factors associated with Big Data, different opportunities in Big Data. Further it will discuss about problems associated with Big Data and how Hadoop emerged as a solution. Below are the topics covered in this tutorial:
1) Evolution of Data
2) What is Big Data?
3) Big Data as an Opportunity
4) Problems in Encasing Big Data Opportunity
5) Hadoop as a Solution
6) Hadoop Ecosystem
7) Edureka Big Data & Hadoop Training
This document provides an overview of big data concepts including definitions of big data, characteristics of big data using the 5Vs model, common big data technologies like Hadoop and MapReduce, and use cases. It discusses how big data has evolved over time through increased data volumes and varieties. Key frameworks like HDFS and MapReduce that enable distributed storage and processing of large datasets are explained. Examples of big data applications in areas such as banking are also provided.
This document provides a syllabus for a course on big data. The course introduces students to big data concepts like characteristics of data, structured and unstructured data sources, and big data platforms and tools. Students will learn data analysis using R software, big data technologies like Hadoop and MapReduce, mining techniques for frequent patterns and clustering, and analytical frameworks and visualization tools. The goal is for students to be able to identify domains suitable for big data analytics, perform data analysis in R, use Hadoop and MapReduce, apply big data to problems, and suggest ways to use big data to increase business outcomes.
Getting real-time analytics for devices/application/business monitoring from trillions of events and petabytes of data like companies Netflix, Uber, Alibaba, Paypal, Ebay, Metamarkets do.
This document discusses various applications of big data across different domains. It begins by defining big data and its key characteristics of volume, variety and velocity. It then discusses how big data is being used in social media for recommendation systems, marketing, electioneering and influence analysis. Applications in healthcare discussed include personalized medicine, clinical trials, electronic health records, and genomics. Uses of big data in smart cities are also summarized, such as for smart transport, traffic management, smart energy, and smart governance. Specific examples and case studies are provided to illustrate the benefits and savings achieved from leveraging big data across these various sectors.
Raffael Marty gave a presentation on big data visualization. He discussed using visualization to discover patterns in large datasets and presenting security information on dashboards. Effective dashboards provide context, highlight important comparisons and metrics, and use aesthetically pleasing designs. Integration with security information management systems requires parsing and formatting data and providing interfaces for querying and analysis. Marty is working on tools for big data analytics, custom visualization workflows, and hunting for anomalies. He invited attendees to join an online community for discussing security visualization.
This document discusses cloud computing, big data, Hadoop, and data analytics. It begins with an introduction to cloud computing, explaining its benefits like scalability, reliability, and low costs. It then covers big data concepts like the 3 Vs (volume, variety, velocity), Hadoop for processing large datasets, and MapReduce as a programming model. The document also discusses data analytics, describing different types like descriptive, diagnostic, predictive, and prescriptive analytics. It emphasizes that insights from analyzing big data are more valuable than raw data. Finally, it concludes that cloud computing can enhance business efficiency by enabling flexible access to computing resources for tasks like big data analytics.
The document outlines a data science roadmap that covers fundamental concepts, statistics, programming, machine learning, text mining, data visualization, big data, data ingestion, data munging, and tools. It provides the percentage of time that should be spent on each topic, and lists specific techniques in each area, such as linear regression, decision trees, and MapReduce in big data.
What is Big Data?
Big Data Laws
Why Big Data?
Industries using Big Data
Current process/SW in SCM
Challenges in SCM industry
How Big data can solve the problems?
Migration to Big data for an SCM industry
Disclaimer :
The images, company, product and service names that are used in this presentation, are for illustration purposes only. All trademarks and registered trademarks are the property of their respective owners.
Data/Image collected from various sources from Internet.
Intention was to present the big picture of Big Data & Hadoop
Monetizing Big Data at Telecom Service ProvidersDataWorks Summit
Hadoop enables telecom companies to gain valuable insights from large amounts of customer data. It provides a cost-effective way to store and analyze call detail records, network traffic data, customer account information, and other big data sources. This allows telecoms to improve network maintenance, enhance the customer experience, optimize marketing campaigns, and reduce customer churn. The document discusses several use cases where telecom companies have used Hadoop to save millions of dollars annually or increase revenue through better data-driven decisions.
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
This presentation about Hadoop will help you learn the basics of Hadoop and its components. First, you will see what is Big Data and the significant challenges in it. Then, you will understand how Hadoop solved those challenges. You will have a glance at the History of Hadoop, what is Hadoop, the different companies using Hadoop, the applications of Hadoop in different companies, etc. Finally, you will learn the three essential components of Hadoop – HDFS, MapReduce, and YARN, along with their architecture. Now, let us get started with Introduction to Hadoop.
Below topics are explained in this Hadoop presentation:
1. Big Data and its challenges
2. Hadoop as a solution
3. History of Hadoop
4. What is Hadoop
5. Applications of Hadoop
6. Components of Hadoop
7. Hadoop Distributed File System
8. Hadoop MapReduce
9. Hadoop YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/introduction-to-big-data-and-hadoop-certification-training.
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
This presentation about Big Data will help you understand how Big Data evolved over the years, what is Big Data, applications of Big Data, a case study on Big Data, 3 important challenges of Big Data and how Hadoop solved those challenges. The case study talks about Google File System (GFS), where you’ll learn how Google solved its problem of storing increasing user data in early 2000. We’ll also look at the history of Hadoop, its ecosystem and a brief introduction to HDFS which is a distributed file system designed to store large volumes of data and MapReduce which allows parallel processing of data. In the end, we’ll run through some basic HDFS commands and see how to perform wordcount using MapReduce. Now, let us get started and understand Big Data in detail.
Below topics are explained in this Big Data presentation for beginners:
1. Evolution of Big Data
2. Why Big Data?
3. What is Big Data?
4. Challenges of Big Data
5. Hadoop as a solution
6. MapReduce algorithm
7. Demo on HDFS and MapReduce
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
This document provides an overview of big data analytics, including:
1) It defines big data and describes its key characteristics of variety, velocity, and volume.
2) It outlines common types of big data like structured, unstructured, and semi-structured data.
3) It lists sources of big data such as social media, the cloud, the web, databases, and the Internet of Things.
4) It discusses challenges of big data like rapid growth, storage, security, and integrating diverse data sources.
This document provides an introduction and overview of data science. It discusses Ravishankar Rajagopalan's educational and professional background working in data science. It then covers various topics related to data science including common applications, required skills, the typical project lifecycle, team aspects, career progression, interviews, and resources for learning. Examples of unusual real-world applications are also summarized, such as using machine learning to optimize inventory levels for an oil and gas company and implementing speech recognition to predict customer intent for a call center.
This document defines big data and discusses techniques for integrating large and complex datasets. It describes big data as collections that are too large for traditional database tools to handle. It outlines the "3Vs" of big data: volume, velocity, and variety. It also discusses challenges like heterogeneous structures, dynamic and continuous changes to data sources. The document summarizes techniques for big data integration including schema mapping, record linkage, data fusion, MapReduce, and adaptive blocking that help address these challenges at scale.
Big data is a huge volume of heterogenous data often generated at high speed.Big data cannot be handles with traditional data analytic tools. Hadoop is one of the mostly used big data analytic tool.Map Reduce, hive, hbase are also the tools for analysis in big data.
This document contains a laboratory manual for the Big Data Analytics laboratory course. It outlines 5 experiments:
1. Downloading and installing Hadoop, understanding different Hadoop modes, startup scripts, and configuration files.
2. Implementing file management tasks in Hadoop such as adding/deleting files and directories.
3. Developing a MapReduce program to implement matrix multiplication.
4. Running a basic WordCount MapReduce program.
5. Installing Hive and HBase and practicing examples.
Big Data Analytics | What Is Big Data Analytics? | Big Data Analytics For Beg...Simplilearn
The presentation about Big Data Analytics will help you know why Big Data analytics is required, what is Big Data analytics, the lifecycle of Big Data analytics, types of Big Data analytics, tools used in Big Data analytics and few Big Data application domains. Also, we'll see a use case on how Spotify uses Big Data analytics. Big Data analytics is a process to extract meaningful insights from Big Data such as hidden patterns, unknown correlations, market trends, and customer preferences. One of the essential benefits of Big Data analytics is used for product development and innovations. Now, let us get started and understand Big Data Analytics in detail.
Below are explained in this Big Data analytics tutorial:
1. Why Big Data analytics?
2. What is Big Data analytics?
3. Lifecycle of Big Data analytics
4. Types of Big Data analytics
5. Tools used in Big Data analytics
6. Big Data application domains
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
The document provides an overview of key concepts in data science including data types, the data value chain, and big data. It defines data science as extracting insights from large, diverse datasets using tools like machine learning. The data value chain involves acquiring, processing, analyzing and using data. Big data is characterized by its volume, velocity and variety. Common techniques for big data analytics include data mining, machine learning and visualization.
The document discusses the syllabus for a course on Big Data Analytics. The syllabus covers four units: (1) an introduction to big data concepts like distributed file systems, Hadoop, and MapReduce; (2) Hadoop architecture including HDFS, MapReduce, and YARN; (3) Hadoop ecosystem components like Hive, Pig, HBase, and Spark; and (4) new features of Hadoop 2.0 like high availability for NameNode and HDFS federation. The course aims to provide students with foundational knowledge of big data technologies and tools for processing and analyzing large datasets.
The document discusses key concepts related to big data including what data and big data are, the three structures of big data (volume, velocity, and variety), sources and types of big data, how big data differs from traditional databases, applications of big data across various fields such as healthcare and social media, tools for working with big data like Hadoop and MongoDB, and challenges and solutions related to big data.
Big data analytics (BDA) involves examining large, diverse datasets to uncover hidden patterns, correlations, trends, and insights. BDA helps organizations gain a competitive advantage by extracting insights from data to make faster, more informed decisions. It supports a 360-degree view of customers by analyzing both structured and unstructured data sources like clickstream data. Businesses can leverage techniques like machine learning, predictive analytics, and natural language processing on existing and new data sources. BDA requires close collaboration between IT, business users, and data scientists to process and analyze large datasets beyond typical storage and processing capabilities.
DATA VISUALIZATION FOR MANAGERS MODULE 1| Creating Visual Analysis with Interactive Data Visualization software Desktop| BUSINESS ANALYTICS PAPER 1 |MBA SEM 3| RTMNU NAGPUR UNIVERSITY| BY JAYANTI R PANDE
MBA Notes by Jayanti Pande
#JayantiPande
#MBA
#MBAnotes
#BusinessAnalyticsNotes
Real World Application of Big Data In Data Mining Toolsijsrd.com
The main aim of this paper is to make a study on the notion Big data and its application in data mining tools like R, Weka, Rapidminer, Knime,Mahout and etc. We are awash in a flood of data today. In a broad range of application areas, data is being collected at unmatched scale. Decisions that previously were based on surmise, or on painstakingly constructed models of reality, can now be made based on the data itself. Such Big Data analysis now drives nearly every aspect of our modern society, including mobile services, retail, manufacturing, financial services, life sciences, and physical sciences. The paper mainly focuses different types of data mining tools and its usage in big data in knowledge discovery.
This document provides an introduction to big data, including defining big data, discussing its history, importance, types, characteristics, how it works, challenges, technologies, and architecture. Big data is defined as extremely large and complex datasets that cannot be processed using traditional tools. It has existed for thousands of years but grew substantially in the 20th century. Companies use big data to improve operations and increase profits. The types include structured, semi-structured, and unstructured data. Big data works through data collection, storage, processing, analysis, and visualization. The challenges include rapid data growth, storage needs, unreliable data, and security issues. Technologies include those for operations and analytics. The architecture includes ingestion, batch processing, analytical storage
Unlock the potential of Big Data with Edvicon. Learn the benefits of harnessing vast information, from our expert instructors. Gain valuable insights and make data-driven decisions for future success.
visit us-https://edvicon.in/
Big data is massive, complex datasets including huge quantities of data from sources like social media. Big data analytics examines large amounts of heterogeneous digital data to glean insights. It involves five characteristics: volume, variety, velocity, value, and veracity. The types of big data are structured, unstructured, and semi-structured. Data repositories like data warehouses and data lakes store organizational data to facilitate decision-making and analytics.
Characterizing and Processing of Big Data Using Data Mining TechniquesIJTET Journal
The document discusses big data and techniques for processing it, including data mining. It begins by defining big data and its key characteristics of volume, variety, and velocity. It then discusses various data mining techniques that can be used to process big data, including clustering, classification, and prediction. It introduces the HACE theorem for characterizing big data based on its huge size, heterogeneous and diverse sources, decentralized control, and complex relationships within the data. The document proposes a big data processing model involving data set aggregation, pre-processing, connectivity-based clustering, and subset selection to efficiently retrieve relevant data. It evaluates the performance of subset selection versus deterministic search methods.
This document provides an overview of the key concepts in the syllabus for a course on data science and big data. It covers 5 units: 1) an introduction to data science and big data, 2) descriptive analytics using statistics, 3) predictive modeling and machine learning, 4) data analytical frameworks, and 5) data science using Python. Key topics include data types, analytics classifications, statistical analysis techniques, predictive models, Hadoop, NoSQL databases, and Python packages for data science. The goal is to equip students with the skills to work with large and diverse datasets using various data science tools and techniques.
This document provides an overview of key concepts related to data and big data. It defines data, digital data, and the different types of digital data including unstructured, semi-structured, and structured data. Big data is introduced as the collection of large and complex data sets that are difficult to process using traditional tools. The importance of big data is discussed along with common sources of data and characteristics. Popular tools and technologies for storing, analyzing, and visualizing big data are also outlined.
Big data refers to huge amounts of data from various sources that traditional data management systems cannot handle. It is characterized by volume, velocity, variety, and veracity. Handling big data requires expertise in security, management, and analytics. Data scientists use descriptive, diagnostic, predictive, and prescriptive analytics techniques on big data to create business insights and decisions using business intelligence tools. While big data offers opportunities, it also poses risks like bad data, security issues, and costs if not properly analyzed and managed.
This document discusses big data, defining it as large volumes of structured, semi-structured, and unstructured data that can be mined for information. It outlines four key characteristics of big data: volume, variety, velocity, and variability. It also discusses big data applications across various industries and provides examples of real-time big data applications. Finally, it covers challenges of conventional data systems and risks associated with big data projects.
Introduction to Big Data: Definition, Characteristic Features, Big Data Applications, Big Data vs Traditional Data, Risks of Big Data, Structure of Big Data, Challenges of Conventional Systems, Web Data, Evolution of Analytic Scalability, Evolution of Analytic Processes, Tools and methods, Analysis vs Reporting, Modern Data Analytic Tools
This document provides an overview of data science and key concepts related to emerging technologies. It describes what data science is and its role, differentiates between data and information, describes the data processing life cycle and common data types. It also discusses the basics of big data, including characteristics like volume, velocity and variety. Finally, it introduces clustered computing and components of the Hadoop ecosystem.
The document discusses how utilities are increasingly collecting and generating large amounts of data from smart meters and other sensors. It notes that utilities must learn to leverage this "big data" by acquiring, organizing, and analyzing different types of structured and unstructured data from various sources in order to make more informed operational and business decisions. Effective use of big data can help utilities optimize operations, improve customer experience, and increase business performance. However, most utilities currently underutilize data analytics capabilities and face challenges in integrating diverse data sources and systems. The document advocates for a well-designed data management platform that can consolidate utility data to facilitate deeper analysis and more valuable insights.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSIJNSA Journal
The smart irrigation system represents an innovative approach to optimize water usage in agricultural and landscaping practices. The integration of cutting-edge technologies, including sensors, actuators, and data analysis, empowers this system to provide accurate monitoring and control of irrigation processes by leveraging real-time environmental conditions. The main objective of a smart irrigation system is to optimize water efficiency, minimize expenses, and foster the adoption of sustainable water management methods. This paper conducts a systematic risk assessment by exploring the key components/assets and their functionalities in the smart irrigation system. The crucial role of sensors in gathering data on soil moisture, weather patterns, and plant well-being is emphasized in this system. These sensors enable intelligent decision-making in irrigation scheduling and water distribution, leading to enhanced water efficiency and sustainable water management practices. Actuators enable automated control of irrigation devices, ensuring precise and targeted water delivery to plants. Additionally, the paper addresses the potential threat and vulnerabilities associated with smart irrigation systems. It discusses limitations of the system, such as power constraints and computational capabilities, and calculates the potential security risks. The paper suggests possible risk treatment methods for effective secure system operation. In conclusion, the paper emphasizes the significant benefits of implementing smart irrigation systems, including improved water conservation, increased crop yield, and reduced environmental impact. Additionally, based on the security analysis conducted, the paper recommends the implementation of countermeasures and security approaches to address vulnerabilities and ensure the integrity and reliability of the system. By incorporating these measures, smart irrigation technology can revolutionize water management practices in agriculture, promoting sustainability, resource efficiency, and safeguarding against potential security threats.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
2. Introduction
DEFINITION
Big data is defined as the collection of large
and complex datasets that are difficult to
process using database system tools or
traditional data processing application
software.
Mainframe
(Kilobytes)
Client /Server
(Megabytes)
The Internet
(Gigabytes)
[Big data]
Mobile, Social
media…
(Zettabytes)
3. Characteristics of Big data
The characteristics of big data is specified with 5V’s:
1. Volume – It refers to vast amount of data generated
every second. [Kilo bytes->Mega->Giga->Tera->Petta-
>Exa->Zetta->Yotta]
2. Variety – It refers to the different kinds of data
generated from different sources.
3. Velocity – It refers to the speed of data generation ,
process, and moves around.
4. Value – It refers to bring out the correct meaning out
of available data.
5. Veracity – It refers to the uncertainty and
inconsistencies in the data.
4. Categories of Big data
Big data is categorized into three forms.
1.Structured – Data which can be stored and
processed in predefined format. Ex: Table,
RDBMS data.
2.Unstructured – Any data without structure or
unknown form. Ex: Output returned by google
search, audio, video, image.
3.Semi_ structured - This type of data contain both
the forms of data. Ex: JSON, CSV,XML, email .
Data types*-> emails, text messages, photos,
videos, logs, documents, transactions, click trails,
public records etc.,
5. Examples of big data
Some examples of big data
1.Social media: 500+ terabytes of data is
generated in facebook everyday, 100,000 +
data is created in tweets for every 60 sec, 300
hours of videos are uploaded in you tube per
minute.
2.Airlines: A single jet engine produce
10+terabytes of data in 30 minutes of a flight
time.
6. Cont..,
3. Stock Exchange- The New York stock
exchange generates about one terabyte of
new trade data everyday.
4. Mobile Phones- For every 60 seconds
698,445+ google searches, 11,000,000+
instant messages, and 168000000 emails are
generated by users.
5. Walmart handles more than 1 million
customer transaction every hour.
7. Sources of big data
1.Activity data- The basic activity like searches
are stored in web browser, the usage of
phone is stored by mobile phones, Credit card
company stores where customer buys and
shop stores what they buys.
2. Conversational data- Conversations in emails,
social media sites like facebook, twitter and so
on.
8. Cont.,
3. Photo and video data- The pictures and
videos taken from mobile phones, digital
camera, and CCTV are uploaded heavily in
youtube and social media sites every second.
4. Sensor data- The sensors embedded in all
devices produce huge amount of data. Ex: GPS
provide direction and speed of a vehicle.
5. IOT data- Smart TV, smart watch, smart fridge
etc. Ex: Traffic sensors send data to alarm
clock in smart watch
9. Typical Classification
I. Internal data – It supports daily business
operations such as organizational or
enterprise data ( Structured). Ex: Customer
data, Sales data, ERP,CRM etc.,
II. External data – It is analyzed for
competitors, market environment and
technology such as social data
(Unstructured). Ex: Internet, Government,
Business partners, Syndicate data suppliers
etc.,
10. Big data storage
Big data storage is concerned with storing and
managing data in a scalable way, satisfying
the needs of applications that require access
to the data.
Some of the big data storage technologies are
1. Distributed file system- Store large amounts
of unstructured data in a reliable way on
commodity hardware
11. Cont.,
Hadoop File System (HDFS) is an integral part
of the Hadoop framework designed for large
data files and is well suited for quickly
ingesting data and bulk processing
2. NoSQL database - Database that stores and
retrieves data that is modeled in means other
than the tabular relations and it lacks ACID
transactions
Supports both structured and unstructured
data
12. The data structures used are key-value, wide
column, graph, or document
Less functionality and more performance
It focus on scalability, performance, and high
availability
Flat files RDBMS NoSQL
No standard
Implementa
tion
Could not
handle big
data
13. 3. NewSQL database - Provide the same
scalable performance of NoSQL systems
for Online Transaction Processing (OLTP) read-
write workloads while still maintaining
the ACID guarantees of a traditional database
system
4. Cloud storage – Service model in which data
is maintained, managed, backed up remotely
and made available to users over the Internet
14. Cont.,
Eliminates the acquisition and management
costs of buying and maintaining your own
storage infrastructure, increases agility,
provides global scale, and delivers "anywhere,
anytime" access to data
Users generally pay for their cloud data
storage on a per-consumption “Pay as per
use”
15. Data intelligence
Data intelligence - Analysis of various forms of
data in such a way that it can be used by
companies to expand their services or
investments
Transforming data into information,
information into knowledge, and knowledge
into value
16. Data integration and serialization
Data integration- Combining data residing in
different sources and providing users with a
unified view of them
Data serialization- It is the concept of
converting structured data into a format that
allows it to be shared or stored in such a way
that its original structure to be recovered.
17. Data monitoring
Data monitoring- It allows an organization to
proactively maintain a high, consistent
standard of data quality
• By checking data routinely as it is stored
within applications, organizations can avoid
the resource-intensive pre-processing of data
before it is moved
• With data monitoring, data quality checked at
creation time rather than before a move.
18. Data indexing
Data indexing- It is a data structure that is
added to a file to provide faster access to the
data.
• It reduces the number of blocks that the
DBMS has to check.
• It contains a search key and a pointer. Search
key - an attribute or set of attributes that is
used to look up the records in a file.
• Pointer - contains the address of where the
data is stored in memory.
19. Why Big data?
These are the factors leads to the emergence of
big data
1. Increase of storage capacity
2. Increase of processing power
3. Availability of data
4. Derive insights and drive growth
5. To be competitive
20. Benefits of Big Data Processing
1. Businesses gains intelligence while decision
making.
2. Better Customer service.
3. Early identification of risk in product/
services.
4. Improved operational efficiency – Product
recommendation.
5. Detecting fraudulent behavior.
21. Applications of Bigdata
Smarter health care – Leverage the health
care system with easy access and efficient
outcome.
Multi channel sales and web display
advertisement
Finance
Intelligence Traffic management
Manufacturing
Fraud and risk detection
Telecom
22. Analysis Vs Analytics
Analysis - It is the process of breaking a complex
topic or substance into smaller parts in order
to gain a better understanding of it
What happened in the past? It is the process
of examining, transforming and arranging raw
data in a specific way to generate useful
information from it
Analytics – It is sub component of analysis that
involves the use of tools and techniques to
find novel, valuable and exploitable patterns
(What will happen in future?)
23. Big data analytics
It is the process of
Collecting, Storing, Organizing and
Analyzing the large set of heterogeneous data
for gaining insights, discover patterns,
correlations and other useful information
Faster and better decision making
Enhance performance, service or product
Cost effective and next generation products
25. Stages in Big data analytics
I. Identifying problem
II. Designing data requirements
III. Preprocessing data
IV. Visualizing data and
V. Performing analytics over data
26. Tradition Vs Big data analytics
Traditional Analytics Big data analytics
Analytics with well known and smaller
in size data
Not well understood format with
largely semi structured or unstructured
data
Built based on relational data models It is retrieved from various sources
with almost flat and no relationship in
nature
27. Four types of analytics
1. Descriptive Analytics : What happened?
It is a backward looking and reveal what has
occurred in the past with the present data
(Hindsight)
Two types: 1) Measures of central tendency
(mean, mode, and median)
2) Measures of dispersion (range,
variance, and standard deviation)
28. 2. Diagnostic Analytics : Why did this happen?
What went wrong?
3. Predictive Analytics : What is likely to
happen?
It predict what could happen in the future
(Insight)
Several models used are i) Forecasting, ii)
Simulation, iii) Regression, iv)Classification,
and v) Clustering
29. 4. Prescriptive analytics – What should we do to
make it happen?
It suggest conclusions or actions that can be
taken based on the analysis
Techniques used are i) Linear programming,
ii)Integer programming, iii)Mixed integer
programming, and iv)Non linear programming
30. Approach in analytics development
Identify the data source
Select the right tools and technology for
collect, store and organize data
Understand the domain and process data
Build mathematical model for your analytics
Visualize and validate the result
Learn, adapt and rebuilt your analytical
model.
31. Big data analytics domain
Web and E-Tailing
Government
Retail
Tele communication
Health care
Finance and banking
32. Big data techniques
There are seven widely used big data analysis
techniques. They are
1. Association rule learning
2. Classification tree analysis
3. Genetic algorithms
4. Machine learning
5. Regression analysis
6. Sentiment analysis
7. Social network analysis
33. Association rule learning
Rule based machine learning method for
discovering the interesting relations between
variables in large database.
In order to select interesting rules from the
set of all possible rules, constraints on various
measures of significance and interest are
used. The best known constraints are
minimum threshold on support and
confidence.
34. Cont.,
Support- Indication of how frequently the
item set appear in the data set.
Confidence – Indication of how often the rule
has been found to be true.
Example Rule for the supermarket
{ bread, butter} => {Milk}, It mean that if butter
and bread are bought, customers also buy
milk.
35. Algorithms for association rule
learning
Some of the familiar algorithms used for mining
frequent item sets are
1.Apriori algorithm- It uses
a) breadth- first search strategy to count the
support of item sets
b) candidate generation function which exploits
the downward closure property of a support
36. Equivalence class transformation
(ECLAT) algorithm
Depth first search algorithm using set
intersection
Suitable for serial and parallel execution with
locality enhancing properties
37. Frequent Pattern (FP) Growth
algorithm
1st
phase - Algorithm counts number of
occurrence of items in dataset and store in
header table
2nd
phase – FP tree structure is built by
inserting instances. Items in each instance
have to be sorted by descending order of their
frequency in the dataset, so that the tree can
be processed quickly.
38. Classification Tree Analysis
It is a type of machine learning algorithm used
to classify the class of an object
Identifies a set of characteristics that best
differentiates individuals based on a
categorical outcome variable
39. Genetic Algorithms
Search based optimization technique based
on the concepts of natural selection and
genetics
In GAs, we have a pool or a population of
possible solutions to the given problem.
These solutions then undergo recombination
and mutation (like in natural genetics),
producing new children, and the process is
repeated over various generations.
40. Cont.,
Each individual is assigned a fitness value
(based on its objective function value) and the
fitter individuals are given a higher chance to
mate and yield more “fitter” individuals
Part of evolutionary algorithms
Three basic operators of GA: (i)
Reproduction, (ii) Mutation, and (iii)
Crossover
41. Machine Learning
It is a method of data analysis that automates
analytical model building
It is an application of Artificial Intelligence
based on the idea that machines should be
able to learn and adapt through experience
Within the field of data analytics, machine
learning is a method used to devise complex
models and algorithms that lend themselves
to prediction
42. Cont.,
• Machine learning is a branch of science that
deals with programming the systems in such a
way that they automatically learn and
improve with experience.
• Learning means recognizing and
understanding the input data and making wise
decisions based on the supplied data.
43. Cont.,
• It is very difficult to cater to all the decisions
based on all possible inputs. To tackle this
problem, algorithms are developed. These
algorithms build knowledge from specific data
and past experience with the principles of
statistics, probability theory, logic,
combinatorial optimization, search,
reinforcement learning, and control theory.
44. Learning types
There are several ways to implement machine
learning techniques, however the most
commonly used ones
Supervised learning
Unsupervised learning
Semi supervised learning
45. Supervised learning
• Deals with learning a function from available training
data. Known input and output variable. Use an
algorithm to learn the mapping function from input
to output [Y=f(X)]
• Analyzes the training data and produces an inferred
function, which can be used for mapping new
examples
• Some supervised learning algorithms are neural
networks, Support Vector Machines (SVMs), and
Naive Bayes Classifiers, Random forest, Decision
Trees, Regression.
• Ex: classifying spam, voice recognization, regression
46. Unsupervised Learning
Makes sense of unlabeled data without having any
predefined dataset for its training. Only input (X)
and no corresponding output variable
Model the underlying structure or distribution in the
data in order to learn more about data
It is most commonly used for clustering similar input
into logical groups
Common approaches: K means, self organizing maps
and hierarchical clustering
Techniques: Recommendation, Association,
Clustering
47. Semi Supervised Learning
Problems where you have a large amount of
input data (X) and only some of the data is
labeled
Example: In photo archive where only some
of the images are labeled and the majority are
unlabeled
48. Regression Analysis
• It is a set of statistical processes for estimating
the relationships among variables
• Regression analysis helps one understand how
the typical value of the dependent variable (or
'criterion variable') changes when any one of
the independent variables is varied, while the
other independent variables are held fixed.
• Widely used for prediction and forecasting,
where its use has substantial overlap with the
field of machine learning.
49. Cont.,
• This technique is used for forecasting, time
series modeling and finding the causal effect
relationship between the variables. For
example, relationship between rash driving
and number of road accidents by a driver is
best studied through regression.
50. Sentiment Analysis/ Opinion
Mining
Using NLP, statistics, or machine learning
methods to extract, identify, or otherwise
characterize the sentiment content of a text
unit
Sentiment = feelings
Attitudes – Emotions – Opinions
Subjective impressions, not facts
51. *A common use case for this technology is to
discover how people feel about a particular
topic
Automated extraction of subjective content
from digital text and predicting the
subjectivity such as positive, negative or
neutral
52. Social Network Analysis
• Process of investigating social structures
through the use of networks and graph theory
• It is the mapping and measuring of
relationships and flows between people,
groups, organizations, computers, URLs, and
other connected information/knowledge
entities.
• The nodes in the network are the people and
groups while the links show relationships or
flows between the nodes.
53. Two types of SNA
• Egocentric Analysis
– Focuses on the individual and studies an
individual’s personal network and its affects
on that individual
• Sociocentric Analysis
– Focuses on large groups of people – Quantifies
relationships between people in a group
– Studies patterns of interactions and how these
patterns affect the group as a whole
54. Egocentric Analysis
• Examines local network structure
• Describes the network around a
single node (the ego)
– Number of other nodes (alters)
– Types of connections
• Extracts network features
• Uses these factors to predict health and
longevity, economic success, levels of
depression, access to new opportunities
55. Sociocentric Analysis
• Quantifies relationships and interactions
between a group of people
• Studies how interactions, patterns of
interactions, and network structure affect
– Concentration of power and resources
– Spread of disease
– Access to new ideas
– Group dynamics
56. Big data analytics tools and
technologies
Hadoop = HDFS + Map Reduce
HiveHBase
Flume
Oozie
Pig
Flume
Sqoop
Khufka
Storm
RHadoop
Chukwa
57. Future role of data
Now Future
DNS =
Data
Decision Support
System
Digital Nervous
System (DNS)
Data
Sense ActDecideInterpret
58. History of Hadoop
1996-2000 2003-04 2005-06 2010 2013
Yahoo
Big data problem faced by all search engines
Google
Google file system and Map reduce papers
Hadoop spawns
Cloud era
Apache
(Dough & Mike)
Next generation Hadoop / Yarn &
Mapreduce2
59. Hadoop
It is an open source framework used for
distributed storage and processing of dataset
of big data using MapReduce programming
model
• The core components are i) Hadoop
Common – contains libraries and utilities
needed by other Hadoop modules;
60. • Hadoop Distributed File System (HDFS) –
Stores data on commodity machines,
providing very high aggregate bandwidth
across the cluster
• Hadoop YARN – a platform responsible for
managing computing resources in clusters and
using them for scheduling users' applications
• Hadoop MapReduce – an implementation of
the MapReduce programming model for
large-scale data processing.
61. Distributed Computing
Use of commodity hardware and open source
software (Increase in number of
processers)against expensive proprietary
software on expensive hardware (Server)
62. Major Components of Hadoop
Framework
1. HDFS (Hadoop Distributed File System):
Inspired from Google file system
2. Map Reduce : Inspired from Google Map
Reduce
* Both work on cluster of systems, hierarchical
architecture
Hdfs
Map Reduce
64. Master Node: It monitors the data distributed
among data node
Data Node: Stores the data blocks
* Both are Hadoop daemons. Actually java
programs run on specific machines
65. Map Reduce
It is divided into 2 phases
1.Map - Mapper code is distributed among
machines and it work on the data which the
system holds (Data locality). The locally
computed results are aggregated and sent to
reducer.
Map Map Map Reduce
66. 2. Reducer- Reducer algorithms are applied to
global data to produce the final result.
Programmers need to write only the map
logic and reduce logic. The correct distribution
of map code to map machines are handled by
Hadoop.
68. Pig
It is a tool uses scripting statements to
process the data
Simple data flow language which saves
development time and efforts
Typically it was designed for data scientist
who have less programming skills
It is developed by yahoo
69. Hive
It produces SQL type language tool which
runs on top of map reduce
Hive is develop by facebook for data scientist
who have less programming skills.
The code written on pig/ hive gets converted
into map reduce jobs and run on HDFS
70. Sqoop/ Flume
Inorder to facilitate the movement of data to
Rhadoop, sqoop/flume is used
Sqoop is used to move the data from
Relational database and Flume is used to
inject the data as it was created by external
source.
Hbase is a tool which provide features like
real time database to receive data from HDFS