SlideShare a Scribd company logo
1 of 55
11
Seminar Topic
Big Data Technologies
An introduction to Big Data
Presentation Agenda
• What is Big Data
• Who is it for
• Why Big Data
• How to get started, and When
• Objectives and Benefits
What is Big Data ?
Large-Scale Data Management
Data Science and Analytics
Managing very large amounts of data and extracting value from it !
Data is the New Gold – Data Mining
Big Data - No single standard definition
“Big Data” is the data whose scale, diversity, and complexity require new
architecture, techniques, algorithms, and analytics to manage it and extract
value and hidden knowledge from it…
Examples : Google, Yahoo, Facebook, eBay, Amazon and many enterprises…
As per Wikipedia…
“Big data is an all-encompassing term for any collection of data sets so large
and complex that it becomes difficult to process using on-hand data
management tools or traditional data processing applications.”
What makes data, “Big” Data?
Data always existed; we made it Big.
Data generation
• Web data, e-commerce
• Purchases at department and grocery stores
• Bank/Credit Card transactions
• Social Networks
• Health care records
• Satellite imagery and weather modeling
• Many sources of Data
• Available in Public domain
Data explosion
Data Approximation
• Google processes 20 PB a day (2008)
• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
• Wayback Machine has 3 PB + 100 TB/month (3/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• CERN’s Large Hydron Collider (LHC) generates 15 PB a year
Characteristics of Big Data:
1-Scale (Volume)
• Data Volume
▫ 44x increase from 2009 to 2020
▫ From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially
Exponential increase in
collected/generated data
Characteristics of Big Data:
2-Complexity (Varity)
• Various formats, types, and
structures
• Text, numerical, images, audio,
video, sequences, time series,
social media data, multi-dim
arrays, etc…
• Static data vs. streaming data
• A single application can be
generating/collecting many types
of data
Characteristics of Big Data:
3-Speed (Velocity)
• Data is being generated fast and need to be processed fast
• Online Data Analytics
• Late decisions  missing opportunities
Big Data: 3V’s
Some Make it 4V’s
Who is Big Data for ?
The Four Pillars of an Organization
Harnessing Big Data
• OLTP: Online Transaction Processing (RDBMSs)
• OLAP: Online Analytical Processing (Data Warehousing)
• RTAP: Real-Time Analytics Processing (Big Data Solutions)
The Model Has Changed
• The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
Modern Data Types
Challenges in Handling Big Data
• The Bottleneck
▫ New architecture, algorithms, techniques, and solutions are needed
• Need for New talent with technical skills
▫ Experts in using the new technology and dealing with big data
• How can we process all that information?
• There are actually two problems
– Large scale "data storage"
– Large scale "data analysis“
Data Processing Scalability
• We are generating more data than ever before
• Fortunately, the size and cost of storage has kept pace
Capacity has increased while price has decreased
Year Capacity (GB) Cost per GB (USD)
1997 2.1 $157
2004 200 $1.05
2014 3,000 $0.036
Disk Capacity and Price
Disk Capacity and Performance
• Disk performance has also increased in the last 15 years
• Unfortunately, transfer rates have not kept pace with capacity
Year Capacity (GB) Transfer Rate (MB/s) Disk Read Time
1997 2.1 16.6 126 seconds
2004 200 56.5 59 minutes
2014 3,000 210 3 hours, 58 minutes
How does it work ?
Big Data Technologies in use today !
The Competition and Complexity is
Immense…
Big Data and HADOOP Landscape
Tools in the Big Data Landscape
Some popular vendor enterprises
How to Get started with
Big Data, and When ?
Learning curve for Big Data ?
• Big Data involves not just several tools, but numerous technologies,
methodologies, and mathematical and/or statistical concepts
• These need to be thought, developed, refined, and applied
appropriately to reach a certain goal
• Algorithms and computing languages are required to practically turn
“Big Data in to Applied Intelligence”
• If a flexible mind starts learning concepts like HADOOP and related
topics now, it can rightly be positioned in a project after few months
Getting started
• Several perceptions and perspectives
▫ Several tools exist for Big-Data technology
▫ Students need the right direction to get started
• Learn the Concepts and The Platform
▫ How big data is managed in a scalable and efficient way
▫ Hadoop can be a good starting point
• Big-Data courses available in the market
▫ Administrator Courses for Apache Hadoop
▫ Developer Courses for Apache Hadoop
• Data Analyst courses
▫ Statistical tools for managing big data
▫ High-Level Languages: Apache Pig, Hive
▫ Programming Languages: Java, C, Python, R etc.
• Data Science courses
▫ Mahout: Data mining and machine learning tools over big data
▫ Scientific Data Modeling
Starting with HADOOP
• Prerequisites of HADOOP platform for learning
• Virtual machine environment
• Any supported or popular Linux distribution
• RHEL, SUSE, Cent OS, Ubuntu or Fedora
• Oracle Java JDK
• HADOOP platform
• Single-node and then clustered with High-Availability
• Cloudera Quickstart VM (CDH 4 or 5)
• Cloudera is one of the pioneers in Big Data technologies
• CDH or Cloudera Distribution for HADOOP available as a VM
• Downloadable from Cloudera website
• Other needed software packages
A Brief Introduction to HADOOP
• High Availability Distributed Object Oriented Platform
• Hadoop is the Brain child of Doug Cutting
• Hadoop was originally a nick name of his son’s toy elephant
• Based on Google’s published whitepapers
• Developed by The Apache Software Foundation (http://apache.org)
• Google started in 1990’s. 2000’s brought data management complexities
• In 2004, Google published whitepaper on MapReduce, a framework that
provides a parallel processing model
A Brief Introduction to HADOOP
• Google’s technologies namely
1. GFS (Google File System) – A distributed file system
2. MapReduce – A framework for parallel processing
3. BigTable – A Data storage system
• These are reverse engineered and re-engineered by
Apache Software Foundation, and called as:
1. HDFS (Hadoop Distributed File System)
2. MapReduce
3. Apache HBase
HADOOP - Business problems types
How does MapReduce help
Hadoop and MapReduce Architecture
Big Data Platform - Hardware and Software needed !
A Sample HADOOP Cluster Configuration
Companies using HADOOP
History of Databases
Relational Dominance
NoSQL = Not Only SQL
NoSQL Database Types
RDBMS vs. HADOOP
Objectives and Benefits of
Big Data
Some highlights of IT industry
• The term “IT” for Information technology first appeared in 1958
• IT has been a catalyst to areas of science and technology, and businesses
• A movement from IT driven industry to open information society
• We are today a part global village, via internet, which is now a commodity
or a common consumer service to many people
• Rapidly changing technologies
• Fast pace of Innovations and Inventions
• One needs to keep pace with new developments
• Enterprise examples - Blackberry, WhatsApp
What’s driving the Big Data Market
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
Value of Big Data Analytics
• Big data is more real-time in nature
than traditional DW applications
• Traditional DW architectures (e.g.
Exadata, Teradata) are not well-
suited for big data apps
• Shared nothing, massively parallel
processing, scale out architectures
are well-suited for big data apps
Real-world scenarios
• IMAGINE YOUR BOSS COMES TO YOU AND SAYS:
“HERE ARE 50 GB OF LOGFILES—FIND A WAY TO IMPROVE
OUR business!”
• What would you do?
• Where would you start?
• And what would you do next?
Use Big Data to make better Business Decisions
Some Examples of Big Data Projects
• Consumer product companies monitoring social media like Facebook and
Twitter to get an unprecedented view into customer behavior, preferences,
and product perception.
• Governments are making data public at both the national and
international level for users to develop new applications.
• Sports teams are using data for tracking ticket sales and even for tracking
team strategies.
• Manufacturers are monitoring minute vibration data from their equipment,
to predict the optimal time for component replacement.
• Financial Services organizations are using data mined from customer
interactions to create increasingly relevant and sophisticated offers.
• Advertising and marketing agencies are tracking social media to understand
responsiveness to campaigns, promotions, and other advertising mediums.
• Web-based businesses are developing information products that combine
data gathered from customers to offer more appealing recommendations
and more successful coupon programs.
Cloud and Big Data
Most of the traditional IT skills are being moved towards the Cloud and Big
Data platform.
Some related fields:
• Artificial Intelligence
• Distributed computing / super computing
• Business Analytics / Business Intelligence
• Data Analytics / Data Mining
• Projects running on legacy IT infrastructure
The Google Public Data Explorer
Makes large datasets easy to explore, visualize and communicate.
Google Cloud Platform
Makes large datasets easy to explore, visualize and communicate.
Google Cloud Platform
Project Dashboard
Google BigQuery
Analyze big data in the cloud using Google's BigQuery.
Big Data Market
*Thank You

More Related Content

What's hot

The importance of data
The importance of dataThe importance of data
The importance of dataAPNIC
 
Addressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop WayAddressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop WayXoriant Corporation
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research reportJULIO GONZALEZ SANZ
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadhMithlesh Sadh
 
Big Data Characteristics And Process PowerPoint Presentation Slides
Big Data Characteristics And Process PowerPoint Presentation SlidesBig Data Characteristics And Process PowerPoint Presentation Slides
Big Data Characteristics And Process PowerPoint Presentation SlidesSlideTeam
 
Data-Ed Webinar: Demystifying Big Data
Data-Ed Webinar: Demystifying Big Data Data-Ed Webinar: Demystifying Big Data
Data-Ed Webinar: Demystifying Big Data DATAVERSITY
 
Introducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by JaseelaIntroducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by JaseelaStudent
 
10 Most Effective Big Data Technologies
10 Most Effective Big Data Technologies10 Most Effective Big Data Technologies
10 Most Effective Big Data TechnologiesMahindra Comviva
 
Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria?
Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria? Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria?
Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria? INACAP
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big DataIndu Khemchandani
 
A novel approach to big data veracity using crowd-sourcing techniques
A novel approach to big data veracity using crowd-sourcing techniques A novel approach to big data veracity using crowd-sourcing techniques
A novel approach to big data veracity using crowd-sourcing techniques Abhiram Ravikumar
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsKaniska Mandal
 
Big Data and Data Analytics in Homeland Security and Public Safety Market 201...
Big Data and Data Analytics in Homeland Security and Public Safety Market 201...Big Data and Data Analytics in Homeland Security and Public Safety Market 201...
Big Data and Data Analytics in Homeland Security and Public Safety Market 201...Homeland Security Research Corp.
 

What's hot (20)

Research paper on big data and hadoop
Research paper on big data and hadoopResearch paper on big data and hadoop
Research paper on big data and hadoop
 
The importance of data
The importance of dataThe importance of data
The importance of data
 
National Conference - Big Data - 31 Jan 2015
National Conference - Big Data - 31 Jan 2015National Conference - Big Data - 31 Jan 2015
National Conference - Big Data - 31 Jan 2015
 
Addressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop WayAddressing Big Data Challenges - The Hadoop Way
Addressing Big Data Challenges - The Hadoop Way
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Big Data
Big DataBig Data
Big Data
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research report
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
 
Our big data
Our big dataOur big data
Our big data
 
Big Data Characteristics And Process PowerPoint Presentation Slides
Big Data Characteristics And Process PowerPoint Presentation SlidesBig Data Characteristics And Process PowerPoint Presentation Slides
Big Data Characteristics And Process PowerPoint Presentation Slides
 
Data-Ed Webinar: Demystifying Big Data
Data-Ed Webinar: Demystifying Big Data Data-Ed Webinar: Demystifying Big Data
Data-Ed Webinar: Demystifying Big Data
 
Introducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by JaseelaIntroducing Technologies for Handling Big Data by Jaseela
Introducing Technologies for Handling Big Data by Jaseela
 
10 Most Effective Big Data Technologies
10 Most Effective Big Data Technologies10 Most Effective Big Data Technologies
10 Most Effective Big Data Technologies
 
Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria?
Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria? Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria?
Sr. Jon Ander, Internet de las Cosas y Big Data: ¿hacia dónde va la Industria?
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
A novel approach to big data veracity using crowd-sourcing techniques
A novel approach to big data veracity using crowd-sourcing techniques A novel approach to big data veracity using crowd-sourcing techniques
A novel approach to big data veracity using crowd-sourcing techniques
 
Big data
Big dataBig data
Big data
 
Taming Big Data With Modern Software Architecture
Taming Big Data  With Modern Software ArchitectureTaming Big Data  With Modern Software Architecture
Taming Big Data With Modern Software Architecture
 
Core concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data AnalyticsCore concepts and Key technologies - Big Data Analytics
Core concepts and Key technologies - Big Data Analytics
 
Big Data and Data Analytics in Homeland Security and Public Safety Market 201...
Big Data and Data Analytics in Homeland Security and Public Safety Market 201...Big Data and Data Analytics in Homeland Security and Public Safety Market 201...
Big Data and Data Analytics in Homeland Security and Public Safety Market 201...
 

Viewers also liked

dr. Kristijonas Vizbaras (Brolis Semiconductors) Doktorantai – pagrindinė mok...
dr. Kristijonas Vizbaras (Brolis Semiconductors) Doktorantai – pagrindinė mok...dr. Kristijonas Vizbaras (Brolis Semiconductors) Doktorantai – pagrindinė mok...
dr. Kristijonas Vizbaras (Brolis Semiconductors) Doktorantai – pagrindinė mok...Aidis Stukas
 
Limitations in education system in bangladesh
Limitations in education system in bangladeshLimitations in education system in bangladesh
Limitations in education system in bangladeshujjal paul
 
Inventario De Depresion De Beck
Inventario De Depresion De BeckInventario De Depresion De Beck
Inventario De Depresion De BeckElizabeth Torres
 
Lmcp 1532 pembangunan bandar mapan
Lmcp 1532 pembangunan bandar mapanLmcp 1532 pembangunan bandar mapan
Lmcp 1532 pembangunan bandar mapanHairine Masindai
 
夜寝ていたら突然本番障害だと起こされて異世界に召還されたときに役に立つこと
夜寝ていたら突然本番障害だと起こされて異世界に召還されたときに役に立つこと夜寝ていたら突然本番障害だと起こされて異世界に召還されたときに役に立つこと
夜寝ていたら突然本番障害だと起こされて異世界に召還されたときに役に立つことkouzirou tenkubashi
 
Trading Talks: SoloTrading
Trading Talks: SoloTradingTrading Talks: SoloTrading
Trading Talks: SoloTradingRankia
 

Viewers also liked (12)

dr. Kristijonas Vizbaras (Brolis Semiconductors) Doktorantai – pagrindinė mok...
dr. Kristijonas Vizbaras (Brolis Semiconductors) Doktorantai – pagrindinė mok...dr. Kristijonas Vizbaras (Brolis Semiconductors) Doktorantai – pagrindinė mok...
dr. Kristijonas Vizbaras (Brolis Semiconductors) Doktorantai – pagrindinė mok...
 
Limitations in education system in bangladesh
Limitations in education system in bangladeshLimitations in education system in bangladesh
Limitations in education system in bangladesh
 
Rocks of Aia
Rocks of AiaRocks of Aia
Rocks of Aia
 
COURS
COURSCOURS
COURS
 
COURS
COURSCOURS
COURS
 
Inventario De Depresion De Beck
Inventario De Depresion De BeckInventario De Depresion De Beck
Inventario De Depresion De Beck
 
Día de la paz 2017
Día de la paz 2017Día de la paz 2017
Día de la paz 2017
 
Lmcp 1532 pembangunan bandar mapan
Lmcp 1532 pembangunan bandar mapanLmcp 1532 pembangunan bandar mapan
Lmcp 1532 pembangunan bandar mapan
 
E.L.W mkt campaing
E.L.W mkt campaing E.L.W mkt campaing
E.L.W mkt campaing
 
夜寝ていたら突然本番障害だと起こされて異世界に召還されたときに役に立つこと
夜寝ていたら突然本番障害だと起こされて異世界に召還されたときに役に立つこと夜寝ていたら突然本番障害だと起こされて異世界に召還されたときに役に立つこと
夜寝ていたら突然本番障害だと起こされて異世界に召還されたときに役に立つこと
 
Trading Talks: SoloTrading
Trading Talks: SoloTradingTrading Talks: SoloTrading
Trading Talks: SoloTrading
 
Treball tècnic
Treball tècnicTreball tècnic
Treball tècnic
 

Similar to Big-Data-Seminar-6-Aug-2014-Koenig

Content1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docxContent1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docxdickonsondorris
 
Special issues on big data
Special issues on big dataSpecial issues on big data
Special issues on big dataVedanand Singh
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusersBob Hardaway
 
Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01nayanbhatia2
 
ppt final.pptx
ppt final.pptxppt final.pptx
ppt final.pptxkalai75
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - IntroductionTomy Rhymond
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataRoi Blanco
 
Data Science Overview
Data Science OverviewData Science Overview
Data Science OverviewDavide Mauri
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data AnalyticsS P Sajjan
 

Similar to Big-Data-Seminar-6-Aug-2014-Koenig (20)

Content1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docxContent1. Introduction2. What is Big Data3. Characte.docx
Content1. Introduction2. What is Big Data3. Characte.docx
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
Special issues on big data
Special issues on big dataSpecial issues on big data
Special issues on big data
 
Big_Data_ppt[1] (1).pptx
Big_Data_ppt[1] (1).pptxBig_Data_ppt[1] (1).pptx
Big_Data_ppt[1] (1).pptx
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 
Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01Bigdatappt 140225061440-phpapp01
Bigdatappt 140225061440-phpapp01
 
ppt final.pptx
ppt final.pptxppt final.pptx
ppt final.pptx
 
Big data with Hadoop - Introduction
Big data with Hadoop - IntroductionBig data with Hadoop - Introduction
Big data with Hadoop - Introduction
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Big data
Big dataBig data
Big data
 
Kartikey tripathi
Kartikey tripathiKartikey tripathi
Kartikey tripathi
 
big-data-notes1.ppt
big-data-notes1.pptbig-data-notes1.ppt
big-data-notes1.ppt
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Ds01 data science
Ds01   data scienceDs01   data science
Ds01 data science
 
Data Science Overview
Data Science OverviewData Science Overview
Data Science Overview
 
Big data
Big dataBig data
Big data
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
Big data
Big dataBig data
Big data
 

More from Manish Chopra

AWS and Slack Integration - Sending CloudWatch Notifications to Slack.pdf
AWS and Slack Integration - Sending CloudWatch Notifications to Slack.pdfAWS and Slack Integration - Sending CloudWatch Notifications to Slack.pdf
AWS and Slack Integration - Sending CloudWatch Notifications to Slack.pdfManish Chopra
 
Getting Started with ChatGPT.pdf
Getting Started with ChatGPT.pdfGetting Started with ChatGPT.pdf
Getting Started with ChatGPT.pdfManish Chopra
 
Grafana and AWS - Implementation and Usage
Grafana and AWS - Implementation and UsageGrafana and AWS - Implementation and Usage
Grafana and AWS - Implementation and UsageManish Chopra
 
Containers Auto Scaling on AWS.pdf
Containers Auto Scaling on AWS.pdfContainers Auto Scaling on AWS.pdf
Containers Auto Scaling on AWS.pdfManish Chopra
 
OpenKM Solution Document
OpenKM Solution DocumentOpenKM Solution Document
OpenKM Solution DocumentManish Chopra
 
Alfresco Content Services - Solution Document
Alfresco Content Services - Solution DocumentAlfresco Content Services - Solution Document
Alfresco Content Services - Solution DocumentManish Chopra
 
Jenkins Study Guide ToC
Jenkins Study Guide ToCJenkins Study Guide ToC
Jenkins Study Guide ToCManish Chopra
 
Ansible Study Guide ToC
Ansible Study Guide ToCAnsible Study Guide ToC
Ansible Study Guide ToCManish Chopra
 
Microservices with Dockers and Kubernetes
Microservices with Dockers and KubernetesMicroservices with Dockers and Kubernetes
Microservices with Dockers and KubernetesManish Chopra
 
Unix and Linux Operating Systems
Unix and Linux Operating SystemsUnix and Linux Operating Systems
Unix and Linux Operating SystemsManish Chopra
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive AnalyticsManish Chopra
 
Preparing a Dataset for Processing
Preparing a Dataset for ProcessingPreparing a Dataset for Processing
Preparing a Dataset for ProcessingManish Chopra
 
Organizations with largest hadoop clusters
Organizations with largest hadoop clustersOrganizations with largest hadoop clusters
Organizations with largest hadoop clustersManish Chopra
 
Distributed File Systems
Distributed File SystemsDistributed File Systems
Distributed File SystemsManish Chopra
 
Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Manish Chopra
 
Oracle solaris 11 installation
Oracle solaris 11 installationOracle solaris 11 installation
Oracle solaris 11 installationManish Chopra
 
Big Data Analytics Course Guide TOC
Big Data Analytics Course Guide TOCBig Data Analytics Course Guide TOC
Big Data Analytics Course Guide TOCManish Chopra
 
Emergence and Importance of Cloud Computing for the Enterprise
Emergence and Importance of Cloud Computing for the EnterpriseEmergence and Importance of Cloud Computing for the Enterprise
Emergence and Importance of Cloud Computing for the EnterpriseManish Chopra
 
Steps to create an RPM package in Linux
Steps to create an RPM package in LinuxSteps to create an RPM package in Linux
Steps to create an RPM package in LinuxManish Chopra
 
Setting up a HADOOP 2.2 cluster on CentOS 6
Setting up a HADOOP 2.2 cluster on CentOS 6Setting up a HADOOP 2.2 cluster on CentOS 6
Setting up a HADOOP 2.2 cluster on CentOS 6Manish Chopra
 

More from Manish Chopra (20)

AWS and Slack Integration - Sending CloudWatch Notifications to Slack.pdf
AWS and Slack Integration - Sending CloudWatch Notifications to Slack.pdfAWS and Slack Integration - Sending CloudWatch Notifications to Slack.pdf
AWS and Slack Integration - Sending CloudWatch Notifications to Slack.pdf
 
Getting Started with ChatGPT.pdf
Getting Started with ChatGPT.pdfGetting Started with ChatGPT.pdf
Getting Started with ChatGPT.pdf
 
Grafana and AWS - Implementation and Usage
Grafana and AWS - Implementation and UsageGrafana and AWS - Implementation and Usage
Grafana and AWS - Implementation and Usage
 
Containers Auto Scaling on AWS.pdf
Containers Auto Scaling on AWS.pdfContainers Auto Scaling on AWS.pdf
Containers Auto Scaling on AWS.pdf
 
OpenKM Solution Document
OpenKM Solution DocumentOpenKM Solution Document
OpenKM Solution Document
 
Alfresco Content Services - Solution Document
Alfresco Content Services - Solution DocumentAlfresco Content Services - Solution Document
Alfresco Content Services - Solution Document
 
Jenkins Study Guide ToC
Jenkins Study Guide ToCJenkins Study Guide ToC
Jenkins Study Guide ToC
 
Ansible Study Guide ToC
Ansible Study Guide ToCAnsible Study Guide ToC
Ansible Study Guide ToC
 
Microservices with Dockers and Kubernetes
Microservices with Dockers and KubernetesMicroservices with Dockers and Kubernetes
Microservices with Dockers and Kubernetes
 
Unix and Linux Operating Systems
Unix and Linux Operating SystemsUnix and Linux Operating Systems
Unix and Linux Operating Systems
 
Working with Hive Analytics
Working with Hive AnalyticsWorking with Hive Analytics
Working with Hive Analytics
 
Preparing a Dataset for Processing
Preparing a Dataset for ProcessingPreparing a Dataset for Processing
Preparing a Dataset for Processing
 
Organizations with largest hadoop clusters
Organizations with largest hadoop clustersOrganizations with largest hadoop clusters
Organizations with largest hadoop clusters
 
Distributed File Systems
Distributed File SystemsDistributed File Systems
Distributed File Systems
 
Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3Difference between hadoop 2 vs hadoop 3
Difference between hadoop 2 vs hadoop 3
 
Oracle solaris 11 installation
Oracle solaris 11 installationOracle solaris 11 installation
Oracle solaris 11 installation
 
Big Data Analytics Course Guide TOC
Big Data Analytics Course Guide TOCBig Data Analytics Course Guide TOC
Big Data Analytics Course Guide TOC
 
Emergence and Importance of Cloud Computing for the Enterprise
Emergence and Importance of Cloud Computing for the EnterpriseEmergence and Importance of Cloud Computing for the Enterprise
Emergence and Importance of Cloud Computing for the Enterprise
 
Steps to create an RPM package in Linux
Steps to create an RPM package in LinuxSteps to create an RPM package in Linux
Steps to create an RPM package in Linux
 
Setting up a HADOOP 2.2 cluster on CentOS 6
Setting up a HADOOP 2.2 cluster on CentOS 6Setting up a HADOOP 2.2 cluster on CentOS 6
Setting up a HADOOP 2.2 cluster on CentOS 6
 

Big-Data-Seminar-6-Aug-2014-Koenig

  • 3. Presentation Agenda • What is Big Data • Who is it for • Why Big Data • How to get started, and When • Objectives and Benefits
  • 4. What is Big Data ?
  • 5. Large-Scale Data Management Data Science and Analytics Managing very large amounts of data and extracting value from it ! Data is the New Gold – Data Mining
  • 6. Big Data - No single standard definition “Big Data” is the data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it… Examples : Google, Yahoo, Facebook, eBay, Amazon and many enterprises… As per Wikipedia… “Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications.” What makes data, “Big” Data? Data always existed; we made it Big.
  • 7. Data generation • Web data, e-commerce • Purchases at department and grocery stores • Bank/Credit Card transactions • Social Networks • Health care records • Satellite imagery and weather modeling • Many sources of Data • Available in Public domain
  • 9. Data Approximation • Google processes 20 PB a day (2008) • Facebook has 2.5 PB of user data + 15 TB/day (4/2009) • Wayback Machine has 3 PB + 100 TB/month (3/2009) • eBay has 6.5 PB of user data + 50 TB/day (5/2009) • CERN’s Large Hydron Collider (LHC) generates 15 PB a year
  • 10. Characteristics of Big Data: 1-Scale (Volume) • Data Volume ▫ 44x increase from 2009 to 2020 ▫ From 0.8 zettabytes to 35zb • Data volume is increasing exponentially Exponential increase in collected/generated data
  • 11. Characteristics of Big Data: 2-Complexity (Varity) • Various formats, types, and structures • Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc… • Static data vs. streaming data • A single application can be generating/collecting many types of data
  • 12. Characteristics of Big Data: 3-Speed (Velocity) • Data is being generated fast and need to be processed fast • Online Data Analytics • Late decisions  missing opportunities
  • 14. Some Make it 4V’s
  • 15. Who is Big Data for ?
  • 16. The Four Pillars of an Organization
  • 17. Harnessing Big Data • OLTP: Online Transaction Processing (RDBMSs) • OLAP: Online Analytical Processing (Data Warehousing) • RTAP: Real-Time Analytics Processing (Big Data Solutions)
  • 18. The Model Has Changed • The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data
  • 20. Challenges in Handling Big Data • The Bottleneck ▫ New architecture, algorithms, techniques, and solutions are needed • Need for New talent with technical skills ▫ Experts in using the new technology and dealing with big data
  • 21. • How can we process all that information? • There are actually two problems – Large scale "data storage" – Large scale "data analysis“ Data Processing Scalability • We are generating more data than ever before • Fortunately, the size and cost of storage has kept pace Capacity has increased while price has decreased Year Capacity (GB) Cost per GB (USD) 1997 2.1 $157 2004 200 $1.05 2014 3,000 $0.036 Disk Capacity and Price
  • 22. Disk Capacity and Performance • Disk performance has also increased in the last 15 years • Unfortunately, transfer rates have not kept pace with capacity Year Capacity (GB) Transfer Rate (MB/s) Disk Read Time 1997 2.1 16.6 126 seconds 2004 200 56.5 59 minutes 2014 3,000 210 3 hours, 58 minutes
  • 23. How does it work ? Big Data Technologies in use today ! The Competition and Complexity is Immense…
  • 24. Big Data and HADOOP Landscape
  • 25. Tools in the Big Data Landscape
  • 26. Some popular vendor enterprises
  • 27. How to Get started with Big Data, and When ?
  • 28. Learning curve for Big Data ? • Big Data involves not just several tools, but numerous technologies, methodologies, and mathematical and/or statistical concepts • These need to be thought, developed, refined, and applied appropriately to reach a certain goal • Algorithms and computing languages are required to practically turn “Big Data in to Applied Intelligence” • If a flexible mind starts learning concepts like HADOOP and related topics now, it can rightly be positioned in a project after few months
  • 29. Getting started • Several perceptions and perspectives ▫ Several tools exist for Big-Data technology ▫ Students need the right direction to get started • Learn the Concepts and The Platform ▫ How big data is managed in a scalable and efficient way ▫ Hadoop can be a good starting point • Big-Data courses available in the market ▫ Administrator Courses for Apache Hadoop ▫ Developer Courses for Apache Hadoop • Data Analyst courses ▫ Statistical tools for managing big data ▫ High-Level Languages: Apache Pig, Hive ▫ Programming Languages: Java, C, Python, R etc. • Data Science courses ▫ Mahout: Data mining and machine learning tools over big data ▫ Scientific Data Modeling
  • 30. Starting with HADOOP • Prerequisites of HADOOP platform for learning • Virtual machine environment • Any supported or popular Linux distribution • RHEL, SUSE, Cent OS, Ubuntu or Fedora • Oracle Java JDK • HADOOP platform • Single-node and then clustered with High-Availability • Cloudera Quickstart VM (CDH 4 or 5) • Cloudera is one of the pioneers in Big Data technologies • CDH or Cloudera Distribution for HADOOP available as a VM • Downloadable from Cloudera website • Other needed software packages
  • 31. A Brief Introduction to HADOOP • High Availability Distributed Object Oriented Platform • Hadoop is the Brain child of Doug Cutting • Hadoop was originally a nick name of his son’s toy elephant • Based on Google’s published whitepapers • Developed by The Apache Software Foundation (http://apache.org) • Google started in 1990’s. 2000’s brought data management complexities • In 2004, Google published whitepaper on MapReduce, a framework that provides a parallel processing model
  • 32. A Brief Introduction to HADOOP • Google’s technologies namely 1. GFS (Google File System) – A distributed file system 2. MapReduce – A framework for parallel processing 3. BigTable – A Data storage system • These are reverse engineered and re-engineered by Apache Software Foundation, and called as: 1. HDFS (Hadoop Distributed File System) 2. MapReduce 3. Apache HBase
  • 33. HADOOP - Business problems types
  • 35. Hadoop and MapReduce Architecture
  • 36. Big Data Platform - Hardware and Software needed ! A Sample HADOOP Cluster Configuration
  • 40. NoSQL = Not Only SQL
  • 43. Objectives and Benefits of Big Data
  • 44. Some highlights of IT industry • The term “IT” for Information technology first appeared in 1958 • IT has been a catalyst to areas of science and technology, and businesses • A movement from IT driven industry to open information society • We are today a part global village, via internet, which is now a commodity or a common consumer service to many people • Rapidly changing technologies • Fast pace of Innovations and Inventions • One needs to keep pace with new developments • Enterprise examples - Blackberry, WhatsApp
  • 45. What’s driving the Big Data Market - Ad-hoc querying and reporting - Data mining techniques - Structured data, typical sources - Small to mid-size datasets - Optimizations and predictive analytics - Complex statistical analysis - All types of data, and many sources - Very large datasets - More of a real-time
  • 46. Value of Big Data Analytics • Big data is more real-time in nature than traditional DW applications • Traditional DW architectures (e.g. Exadata, Teradata) are not well- suited for big data apps • Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps
  • 47. Real-world scenarios • IMAGINE YOUR BOSS COMES TO YOU AND SAYS: “HERE ARE 50 GB OF LOGFILES—FIND A WAY TO IMPROVE OUR business!” • What would you do? • Where would you start? • And what would you do next? Use Big Data to make better Business Decisions
  • 48. Some Examples of Big Data Projects • Consumer product companies monitoring social media like Facebook and Twitter to get an unprecedented view into customer behavior, preferences, and product perception. • Governments are making data public at both the national and international level for users to develop new applications. • Sports teams are using data for tracking ticket sales and even for tracking team strategies. • Manufacturers are monitoring minute vibration data from their equipment, to predict the optimal time for component replacement. • Financial Services organizations are using data mined from customer interactions to create increasingly relevant and sophisticated offers. • Advertising and marketing agencies are tracking social media to understand responsiveness to campaigns, promotions, and other advertising mediums. • Web-based businesses are developing information products that combine data gathered from customers to offer more appealing recommendations and more successful coupon programs.
  • 49. Cloud and Big Data Most of the traditional IT skills are being moved towards the Cloud and Big Data platform. Some related fields: • Artificial Intelligence • Distributed computing / super computing • Business Analytics / Business Intelligence • Data Analytics / Data Mining • Projects running on legacy IT infrastructure
  • 50. The Google Public Data Explorer Makes large datasets easy to explore, visualize and communicate.
  • 51. Google Cloud Platform Makes large datasets easy to explore, visualize and communicate.
  • 53. Google BigQuery Analyze big data in the cloud using Google's BigQuery.