Thilga

Introduction to BigData
Name: R.Thilakavathi
Class: II M.Sc Computer Science
Batch:2017-2019
Incharge Staff:Ms.M.Florence Dayana

What’s BigData?
No single definition; here is fromWikipedia:
• Big data is the term for acollection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
• Thechallenges include capture, curation, storage,search,
sharing, transfer, analysis, and visualization.
• Thetrend to larger data sets is due to the additional
information derivable from analysis of asingle large set of
related data, ascompared to separate smaller sets with the
sametotal amount of data, allowing correlations to be found to
"spot business trends, determine quality of research, prevent
diseases, link legal citations, combat crime, and determinereal-
time roadway traffic conditions.”
2

12+ TBs
of tweet data
every day
25+ TBs of
log data
every day
?TBsof
dataeveryday
2+
billion
peopleon
theWeb
byend
2011
30 billionRFID
tagstoday
(1.3B in 2005)
4.6
billion
camera
phones
worldwide
100s of
millions
of GPS
enabled
devicessold
annually
76 millionsmart meters
in 2009…
200M by2014

TheEarthscope
• TheEarthscope is the world's largest
scienceproject. Designedto track
North America's geological evolution,
this observatory records data over 3.8
million square miles, amassing 67
terabytes of data. It analyzesseismic
slips in the SanAndreas fault, sure,but
also the plume of magmaunderneath
Yellowstone and much, much more.
(http://www.msnbc.msn.com/id/4436
3598/ns/technology_and_science-
future_of_technology/#.TmetOdQ--uI)

Variety (Complexity)
• Relational Data (Tables/Transaction/Legacy
Data)
• TextData (Web)
• Semi-structured Data (XML)
• Graph Data
– SocialNetwork, Semantic Web (RDF),…
• Streaming Data
– Youcanonly scanthe data once
• Asingle application canbe generating/collecting
many types of data
• Big Public Data (online, weather, finance, etc)
6
Toextract knowledge all these types of
data need to linkedtogether

A Single View to theCustomer
Customer
Social
Media
Gaming
Entertain
Banking
Finance
Our
Known
History
Purchas
e

Velocity (Speed)
• Data is begin generated fast and need to be
processed fast
• Online DataAnalytics
• Late decisions  missingopportunities
• Examples
– E-Promotions: Basedon your current location, your purchase history, what
you like  send promotions right now for store next to you
– Healthcare monitoring: sensors monitoring your activities and body any
abnormal measurements require immediate reaction
8

Real-time/Fast Data
Social media andnetworks
(all of usare generatingdata)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all thetime)
Sensor technology andnetworks
(measuring all kinds of data)
• Theprogress and innovation isno longer hindered by the ability to collect data
• But, by the ability to manage, analyze, summarize, visualize, and discover knowledge
from the collected data in a timely manner and in a scalable fashion
9

Real-Time Analytics/Decision Requirement
Customer
Influence
Behavior
Product
Recommendations
that are Relevant
& Compelling
FriendInvitations
to joina
Game orActivity
that expands
business
Preventing Fraud
as it is Occurring
& preventingmore
proactively
Learning whyCustomers
Switch to competitors
and their offers; in
time toCounter
Improving the
Marketing
Effectiveness of a
Promotion whileit
is still inPlay

Harnessing Big Data
(DBMSs)
(Data Warehousing)
• OLTP:Online Transaction Processing
• OLAP: OnlineAnalytical Processing
• RTAP:Real-TimeAnalytics Processing (Big DataArchitecture & technology)
12

TheModel HasChanged…
• TheModel of Generating/Consuming Datahas
Changed
Old Model: Fewcompanies are generating data, all others are consumingdata
New Model: all of us are generating data, and all of us are consuming data
13

What’s driving BigData
- Ad-hoc querying and reporting
- Data mining techniques
- Structured data, typical sources
- Smallto mid-sizedatasets
- Optimizations and predictive analytics
- Complexstatistical analysis
- All types of data, and manysources
- Very large datasets
- More of areal-time
14

Big Data:
Batch Processing &
Distributed DataStore
Hadoop/Spark; HBase/Cassandra
BI Reporting
OLAP&
Datawarehouse
BusinessObjects, SAS,
Informatica, Cognosother SQL
Reporting Tools
Interactive Business
Intelligence&
In-memoryRDBMS
QliqView, Tableau,HANA
Big Data:
Real Time&
Single View
GraphDatabases
THEEVOLUTION OFBUSINESS INTELLIGENCE
1990’s 2000’s 2010’s
Speed
Scale
Scale
Speed

BigData Analytics
• Bigdata is more real-time in nature
than traditional DWapplications
• Traditional DWarchitectures (e.g.
Exadata,Teradata)are not well-
suited for big dataapps
• Shared nothing, massively parallel
processing, scale out architectures
are well-suited for big dataapps
16

Benefits
• Cost& management
– Economiesof scale, “out-sourced”resource
management
• ReducedTime to deployment
– Easeof assembly,works “out of thebox”
• Scaling
– Ondemand provisioning, co-locate data andcompute
• Reliability
– Massive, redundant, sharedresources
• Sustainability
– Hardware not owned

Infrastructure asaService (IaaS)

More RefinedCategorization
• Storage-as-a-service
• Database-as-a-service
• Information-as-a-service
• Process-as-a-service
• Application-as-a-service
• Platform-as-a-service
• Integration-as-a-service
• Security-as-a-service
• Management/
Governance-as-a-service
• Testing-as-a-service
• Infrastructure-as-a-service InfoWorld Cloud Computing DeepDive

Enabling Technology: Virtualization
Hardware
OperatingSystem
App App App
Hardware
OS
App App App
Hypervisor
OS OS

What does Azure platform offer to
developers?

June3,2008 Slide23
GoogleAppEnginevs.Amazon
EC2/S3
Google’s AppEngine vsAmazon’s EC2
AppEngine:
• Higher-level functionality
(e.g., automatic scaling)
• More restrictive
(e.g., respond to URLonly)
• Proprietary lock-in
EC2/S3:
• Lower-level functionality
• More flexible
• Coarserbilling model
VMs
Flat FileStorage
Python
BigTable
OtherAPI’s

Cloud Resources
• Hadoop on your localmachine
• Hadoop in avirtual machine on yourlocal
machine (Pseudo-Distributed on Ubuntu)
• Hadoop in the clouds with AmazonEC2

Course Prerequisite
• Prerequisite:
– JavaProgramming / C++
– Data Structures andAlgorithm
– ComputerArchitecture
– BasicStatistics and Probability
– Database and Data Mining (preferred)
25

Thilga

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Thilga

Similar to Thilga (20)

Recently uploaded

Recently uploaded (20)

Thilga