SlideShare a Scribd company logo
BIG DATA
BY: ZEESHAN ALAM KHAN(MCA, AMU)
Big Data: A definition
• Big data is a collection of data sets so large and complex
that it becomes difficult to process using on-hand
database management tools. The challenges include
capture, curation, storage, search, sharing, analysis, and
visualization. The trend to larger data sets is due to the
additional information derivable from analysis of a single
large set of related data, as compared to separate smaller
sets with the same total amount of data, allowing
correlations to be found to "spot business trends,
determine quality of research, prevent diseases, link legal
citations, combat crime, and determine real-time roadway
traffic conditions. (Wikipedia)
Big Data: A definition
• Put another way, big data is the realization of greater
business intelligence by storing, processing, and
analyzing data that was previously ignored due to the
limitations of traditional data management
technologies
Source: Harness the Power of Big Data: The IBM Big Data Platform
Lots of data
• 2.5 quintillion bytes of data are generated every day!
– A quintillion is 1018
• Data come from many quarters.
– Social media sites
– Sensors
– Digital photos
– Business transactions
– Location-based data
Source: IBM http://www-01.ibm.com/software/data/bigdata/
The four dimensions of Big Data
• Volume: Large volumes of data
• Velocity: Quickly moving data
• Variety: structured, unstructured, images, etc.
• Veracity: Trust and integrity is a challenge and a must
and is important for big data just as for traditional
relational DBs
Source: IBM http://www-01.ibm.com/software/data/bigdata/
The four dimensions of use
• Aspects of the way in which users want to interact
with their data…
– Totality: Users have an increased desire to process and
analyze all available data
– Exploration: Users apply analytic approaches where the
schema is defined in response to the nature of the query
– Frequency: Users have a desire to increase the rate of
analysis in order to generate more accurate and timely
business intelligence
– Dependency: Users’ need to balance investment in existing
technologies and skills with the adoption of new techniques
Source: IBM http://www-01.ibm.com/software/data/bigdata/
So, in a nutshell
• Big Data is about better analytics!
Why Big Data and BI
Source: Business Intelligence Strategy: A Framework for
Achieving BI Excellence
Source: Business Intelligence Strategy: A Framework for
Achieving BI Excellence
Big Data Conundrum
• Problems:
– Although there is a massive spike available data, the
percentage of the data that an enterprise can understand is
on the decline
– The data that the enterprise is trying to understand is
saturated with both useful signals and lots of noise
Source: IBM http://www-01.ibm.com/software/data/bigdata/
The Big Data platform Manifesto
imperatives and underlying technologies
IBM’s Big Data Platform
Some concepts
• NoSQL (Not Only SQL): Databases that “move
beyond” relational data models (i.e., no tables, limited
or no use of SQL)
– Focus on retrieval of data and appending new data (not
necessarily tables)
– Focus on key-value data stores that can be used to locate
data objects
– Focus on supporting storage of large quantities of
unstructured data
– SQL is not used for storage or retrieval of data
– No ACID (atomicity, consistency, isolation, durability)
NoSQL
• NoSQL focuses on a schema-less architecture (i.e.,
the data structure is not predefined)
• In contrast, traditional relation DBs require the
schema to be defined before the database is built and
populated.
– Data are structured
– Limited in scope
– Designed around ACID principles.
Hadoop
• Hadoop is a distributed file system and data processing
engine that is designed to handle extremely high volumes
of data in any structure.
• Hadoop has two components:
– The Hadoop distributed file system (HDFS), which supports data
in structured relational form, in unstructured form, and in any
form in between
– The MapReduce programing paradigm for managing
applications on multiple distributed servers
• The focus is on supporting redundancy, distributed
architectures, and parallel processing
Some Hadoop Related
Names to Know
• Apache Avro: designed for communication between
Hadoop nodes through data serialization
• Cassandra and Hbase: a non-relational database designed
for use with Hadoop
• Hive: a query language similar to SQL (HiveQL) but
compatible with Hadoop
• Mahout: an AI tool designed for machine learning; that is,
to assist with filtering data for analysis and exploration
• Pig Latin: A data-flow language and execution framework
for parallel computation
• ZooKeeper: Keeps all the parts coordinated and working
together
What to do with the data
Parallels with Data Warehousing
Data Warehouses
• Extraction
• Transformation
• Load
• Connector
• Processing
• User Management
Connector Framework
• Supports access to data by creating indexes that can
be used for access to the data in its native repository
(i.e., it does not manage the data, it keeps track of
where it is located)
Processing Layer
• Two primary functions:
– Indexes content: data are crawled, parsed, and analyzed
with the result that contents are indexed and located
• Processes queries
– Manages access to various servers hosting the indexed and
searchable content
Annotated Query Language
• AQL is an SQL-like declarative language for
performing text analysis and extraction
create view PersonPhone as select P.name as person, N.number as phone
from Person P, Phone PN, Sentence S where Follows(P.name. PN.number, 0, 30)
and Contains(S.sentence, P.name) and Contains(S.sentence, PN.number)
and ContainsRegex(/b(phone|at)b/, SpanBetween(P.name, PN.number));
The provenance viewer
Machine data analysis
Some resources
• BigInsights Wiki
• Information Management Bookstore
• BigData University

More Related Content

What's hot

Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15
madynav
 

What's hot (20)

Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
 
Data Warehouse and Data Mining
Data Warehouse and Data MiningData Warehouse and Data Mining
Data Warehouse and Data Mining
 
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on ReadBig Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
 
Data Warehousing and Mining
Data Warehousing and MiningData Warehousing and Mining
Data Warehousing and Mining
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
 
Data as a service
Data as a serviceData as a service
Data as a service
 
Cortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data CatalogCortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data Catalog
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
 
Global IT Outsourcing case study
Global IT Outsourcing case studyGlobal IT Outsourcing case study
Global IT Outsourcing case study
 
Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15Prcn 2019 stage 1264-question-presentation_poster file_id-15
Prcn 2019 stage 1264-question-presentation_poster file_id-15
 
AzureDay - Introduction Big Data Analytics.
AzureDay  - Introduction Big Data Analytics.AzureDay  - Introduction Big Data Analytics.
AzureDay - Introduction Big Data Analytics.
 
introduction to data warehousing and mining
 introduction to data warehousing and mining introduction to data warehousing and mining
introduction to data warehousing and mining
 
Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...
Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...
Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...
 
DATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MININGDATA WAREHOUSING AND DATA MINING
DATA WAREHOUSING AND DATA MINING
 
Digital intelligence satish bhatia
Digital intelligence satish bhatiaDigital intelligence satish bhatia
Digital intelligence satish bhatia
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
3 Ways Tableau Improves Predictive Analytics
3 Ways Tableau Improves Predictive Analytics3 Ways Tableau Improves Predictive Analytics
3 Ways Tableau Improves Predictive Analytics
 

Similar to Introduction to BIG DATA

How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
Moacyr Passador
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 

Similar to Introduction to BIG DATA (20)

Big Data
Big DataBig Data
Big Data
 
SMAC - Social, Mobile, Analytics and Cloud - An overview
SMAC - Social, Mobile, Analytics and Cloud - An overview SMAC - Social, Mobile, Analytics and Cloud - An overview
SMAC - Social, Mobile, Analytics and Cloud - An overview
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
 
Eclipse day Sydney 2014 BIG data presentation
Eclipse day Sydney 2014 BIG data presentationEclipse day Sydney 2014 BIG data presentation
Eclipse day Sydney 2014 BIG data presentation
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
bigdataintro.pptx
bigdataintro.pptxbigdataintro.pptx
bigdataintro.pptx
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Big data peresintaion
Big data peresintaion Big data peresintaion
Big data peresintaion
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Overview of Bigdata Analytics
Overview of Bigdata Analytics Overview of Bigdata Analytics
Overview of Bigdata Analytics
 
IARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptxIARE_BDBA_ PPT_0.pptx
IARE_BDBA_ PPT_0.pptx
 
Oh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG DataOh! Session on Introduction to BIG Data
Oh! Session on Introduction to BIG Data
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Big Data
Big DataBig Data
Big Data
 

More from Zeeshan Khan

More from Zeeshan Khan (12)

Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Spring security4.x
Spring security4.xSpring security4.x
Spring security4.x
 
Micro services overview
Micro services overviewMicro services overview
Micro services overview
 
XML / WEB SERVICES & RESTful Services
XML / WEB SERVICES & RESTful ServicesXML / WEB SERVICES & RESTful Services
XML / WEB SERVICES & RESTful Services
 
Manual Testing
Manual TestingManual Testing
Manual Testing
 
Collection framework (completenotes) zeeshan
Collection framework (completenotes) zeeshanCollection framework (completenotes) zeeshan
Collection framework (completenotes) zeeshan
 
JUnit with_mocking
JUnit with_mockingJUnit with_mocking
JUnit with_mocking
 
OOPS in Java
OOPS in JavaOOPS in Java
OOPS in Java
 
Java
JavaJava
Java
 
Big data
Big dataBig data
Big data
 
Android application development
Android application developmentAndroid application development
Android application development
 
Cyber crime
Cyber crimeCyber crime
Cyber crime
 

Recently uploaded

一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
aagad
 
audience research (emma) 1.pptxkkkkkkkkkkkkkkkkk
audience research (emma) 1.pptxkkkkkkkkkkkkkkkkkaudience research (emma) 1.pptxkkkkkkkkkkkkkkkkk
audience research (emma) 1.pptxkkkkkkkkkkkkkkkkk
lolsDocherty
 
Article writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptxArticle writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptx
abhinandnam9997
 

Recently uploaded (13)

The Best AI Powered Software - Intellivid AI Studio
The Best AI Powered Software - Intellivid AI StudioThe Best AI Powered Software - Intellivid AI Studio
The Best AI Powered Software - Intellivid AI Studio
 
Pvtaan Social media marketing proposal.pdf
Pvtaan Social media marketing proposal.pdfPvtaan Social media marketing proposal.pdf
Pvtaan Social media marketing proposal.pdf
 
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
 
Bug Bounty Blueprint : A Beginner's Guide
Bug Bounty Blueprint : A Beginner's GuideBug Bounty Blueprint : A Beginner's Guide
Bug Bounty Blueprint : A Beginner's Guide
 
ER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAEER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAE
 
How Do I Begin the Linksys Velop Setup Process?
How Do I Begin the Linksys Velop Setup Process?How Do I Begin the Linksys Velop Setup Process?
How Do I Begin the Linksys Velop Setup Process?
 
audience research (emma) 1.pptxkkkkkkkkkkkkkkkkk
audience research (emma) 1.pptxkkkkkkkkkkkkkkkkkaudience research (emma) 1.pptxkkkkkkkkkkkkkkkkk
audience research (emma) 1.pptxkkkkkkkkkkkkkkkkk
 
The AI Powered Organization-Intro to AI-LAN.pdf
The AI Powered Organization-Intro to AI-LAN.pdfThe AI Powered Organization-Intro to AI-LAN.pdf
The AI Powered Organization-Intro to AI-LAN.pdf
 
The Use of AI in Indonesia Election 2024: A Case Study
The Use of AI in Indonesia Election 2024: A Case StudyThe Use of AI in Indonesia Election 2024: A Case Study
The Use of AI in Indonesia Election 2024: A Case Study
 
Case study on merger of Vodafone and Idea (VI).pptx
Case study on merger of Vodafone and Idea (VI).pptxCase study on merger of Vodafone and Idea (VI).pptx
Case study on merger of Vodafone and Idea (VI).pptx
 
Article writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptxArticle writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptx
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
 

Introduction to BIG DATA

  • 1. BIG DATA BY: ZEESHAN ALAM KHAN(MCA, AMU)
  • 2. Big Data: A definition • Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools. The challenges include capture, curation, storage, search, sharing, analysis, and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions. (Wikipedia)
  • 3. Big Data: A definition • Put another way, big data is the realization of greater business intelligence by storing, processing, and analyzing data that was previously ignored due to the limitations of traditional data management technologies Source: Harness the Power of Big Data: The IBM Big Data Platform
  • 4. Lots of data • 2.5 quintillion bytes of data are generated every day! – A quintillion is 1018 • Data come from many quarters. – Social media sites – Sensors – Digital photos – Business transactions – Location-based data Source: IBM http://www-01.ibm.com/software/data/bigdata/
  • 5. The four dimensions of Big Data • Volume: Large volumes of data • Velocity: Quickly moving data • Variety: structured, unstructured, images, etc. • Veracity: Trust and integrity is a challenge and a must and is important for big data just as for traditional relational DBs Source: IBM http://www-01.ibm.com/software/data/bigdata/
  • 6. The four dimensions of use • Aspects of the way in which users want to interact with their data… – Totality: Users have an increased desire to process and analyze all available data – Exploration: Users apply analytic approaches where the schema is defined in response to the nature of the query – Frequency: Users have a desire to increase the rate of analysis in order to generate more accurate and timely business intelligence – Dependency: Users’ need to balance investment in existing technologies and skills with the adoption of new techniques Source: IBM http://www-01.ibm.com/software/data/bigdata/
  • 7. So, in a nutshell • Big Data is about better analytics!
  • 8. Why Big Data and BI Source: Business Intelligence Strategy: A Framework for Achieving BI Excellence
  • 9. Source: Business Intelligence Strategy: A Framework for Achieving BI Excellence
  • 10. Big Data Conundrum • Problems: – Although there is a massive spike available data, the percentage of the data that an enterprise can understand is on the decline – The data that the enterprise is trying to understand is saturated with both useful signals and lots of noise Source: IBM http://www-01.ibm.com/software/data/bigdata/
  • 11. The Big Data platform Manifesto imperatives and underlying technologies
  • 12. IBM’s Big Data Platform
  • 13. Some concepts • NoSQL (Not Only SQL): Databases that “move beyond” relational data models (i.e., no tables, limited or no use of SQL) – Focus on retrieval of data and appending new data (not necessarily tables) – Focus on key-value data stores that can be used to locate data objects – Focus on supporting storage of large quantities of unstructured data – SQL is not used for storage or retrieval of data – No ACID (atomicity, consistency, isolation, durability)
  • 14. NoSQL • NoSQL focuses on a schema-less architecture (i.e., the data structure is not predefined) • In contrast, traditional relation DBs require the schema to be defined before the database is built and populated. – Data are structured – Limited in scope – Designed around ACID principles.
  • 15. Hadoop • Hadoop is a distributed file system and data processing engine that is designed to handle extremely high volumes of data in any structure. • Hadoop has two components: – The Hadoop distributed file system (HDFS), which supports data in structured relational form, in unstructured form, and in any form in between – The MapReduce programing paradigm for managing applications on multiple distributed servers • The focus is on supporting redundancy, distributed architectures, and parallel processing
  • 16. Some Hadoop Related Names to Know • Apache Avro: designed for communication between Hadoop nodes through data serialization • Cassandra and Hbase: a non-relational database designed for use with Hadoop • Hive: a query language similar to SQL (HiveQL) but compatible with Hadoop • Mahout: an AI tool designed for machine learning; that is, to assist with filtering data for analysis and exploration • Pig Latin: A data-flow language and execution framework for parallel computation • ZooKeeper: Keeps all the parts coordinated and working together
  • 17. What to do with the data
  • 18. Parallels with Data Warehousing Data Warehouses • Extraction • Transformation • Load • Connector • Processing • User Management
  • 19. Connector Framework • Supports access to data by creating indexes that can be used for access to the data in its native repository (i.e., it does not manage the data, it keeps track of where it is located)
  • 20. Processing Layer • Two primary functions: – Indexes content: data are crawled, parsed, and analyzed with the result that contents are indexed and located • Processes queries – Manages access to various servers hosting the indexed and searchable content
  • 21. Annotated Query Language • AQL is an SQL-like declarative language for performing text analysis and extraction create view PersonPhone as select P.name as person, N.number as phone from Person P, Phone PN, Sentence S where Follows(P.name. PN.number, 0, 30) and Contains(S.sentence, P.name) and Contains(S.sentence, PN.number) and ContainsRegex(/b(phone|at)b/, SpanBetween(P.name, PN.number));
  • 22.
  • 25.
  • 26.
  • 27.
  • 28. Some resources • BigInsights Wiki • Information Management Bookstore • BigData University