Science 7 - LAND and SEA BREEZE and its Characteristics
T9
1. Tutorial 9
a)
NewSQLisa class of relational database managementsystemsthatseektoprovide the scalabilityof
NoSQLsystemsforonline transactionprocessing(OLTP) workloadswhile maintainingthe ACID
guaranteesof a traditional database system.... NewSQLsystemsattempttoreconcile the conflicts.
b)
There are three definingpropertiesthatcan helpbreakdownthe term.Dubbedthe three Vs;
volume,velocity,andvariety,these are keytounderstandinghow we canmeasure bigdataand just
howvery different‘bigdata’istooldfashioneddata.
Volume
The most obviousone iswhere we’ll start.Bigdatais aboutvolume.Volumesof datathatcan reach
unprecedentedheightsinfact.It’sestimatedthat2.5quintillionbytesof dataiscreatedeachday,
and as a result,there will be 40zettabytesof data createdby2020 – whichhighlightsanincrease of
300 timesfrom2005. As a result,itisnow not uncommonforlarge companiestohave Terabytes –
and evenPetabytes –of data instorage devicesandonservers.Thisdatahelpsto shape the future
of a companyand itsactions,all while trackingprogress.
Velocity
The growth of data, and the resultingimportanceof it,haschangedthe way we see data.There once
was a time whenwe didn’tsee the importance of datainthe corporate world,butwiththe change
of howwe gatherit,we’ve come torelyon it dayto day. Velocityessentiallymeasureshow fastthe
data iscomingin.Some data will come ininreal-time,whereasotherwill come infitsandstarts,
sentto us inbatches.Andas not all platformswill experience the incomingdataatthe same pace,
it’simportantnotto generalise,discount,orjumptoconclusionswithouthavingall the factsand
figures.
Variety
Data was once collectedfromone place anddeliveredinone format.Once takingthe shape of
database files - suchas, excel,csvandaccess - it isnow beingpresentedinnon-traditionalforms,like
video,text,pdf,andgraphicsonsocial media,aswell asviatechsuch as wearable devices.Although
2. thisdata isextremelyusefultous,itdoescreate more workand require more analytical skillsto
decipherthisincomingdata,make itmanageable andallow ittowork.
1) Seta bigdata strategy
At a highlevel,abigdata strategyisa plandesignedtohelpyouoversee andimprovethe wayyou
acquire,store,manage,share anduse data withinandoutside of yourorganization.A bigdata
strategysetsthe stage for businesssuccessamidanabundance of data.Whendevelopingastrategy,
it’simportantto considerexisting –and future – businessandtechnologygoalsandinitiatives.This
callsfor treatingbigdata like anyothervaluable businessassetratherthanjusta byproductof
applications.
Big Data Infographic
Clickon the infographictolearnmore aboutbigdata.
2) Knowthe sourcesof bigdata
Streamingdatacomesfrom the Internetof Things(IoT) andotherconnecteddevicesthatflow into
IT systemsfromwearables,smartcars,medical devices,industrial equipmentandmore.Youcan
analyze thisbigdata as itarrives,decidingwhichdatatokeepornot keep,andwhichneedsfurther
analysis.
Social mediadatastemsfrominteractionsonFacebook,YouTube,Instagram, etc.Thisincludesvast
amountsof big data inthe form of images,videos,voice,textandsound –useful formarketing,sales
and supportfunctions.Thisdataisofteninunstructuredorsemistructuredforms,soitposesa
unique challenge forconsumptionandanalysis.
Publiclyavailabledatacomesfrommassive amountsof opendatasourceslike the USgovernment’s
data.gov,the CIA World Factbookor the EuropeanUnionOpenData Portal.
Otherbigdata may come from data lakes,clouddatasources,suppliersandcustomers.
3) Access,manage and store bigdata
Moderncomputingsystemsprovide the speed,powerandflexibilityneededto quicklyaccess
massive amountsandtypesof bigdata. Alongwithreliable access,companiesalsoneedmethodsfor
integratingthe data,ensuringdataquality,providingdatagovernance andstorage,andpreparing
the data for analytics.Some datamaybe storedon-premisesinatraditional datawarehouse –but
there are also flexible,low-costoptionsforstoringandhandlingbigdataviacloudsolutions,data
lakesandHadoop.
3. 4) Analyze bigdata
Withhigh-performance technologieslike gridcomputingorin-memoryanalytics,organizationscan
choose to use all theirbigdata for analyses.Anotherapproachistodetermine upfrontwhichdatais
relevantbefore analyzingit.Eitherway,bigdataanalyticsishow companiesgainvalue andinsights
fromdata. Increasingly,bigdatafeedstoday’sadvancedanalyticsendeavorssuchasartificial
intelligence.
5) Make intelligent,data-drivendecisions
Well-managed,trusteddataleadstotrustedanalyticsandtrusteddecisions.Tostaycompetitive,
businessesneedto seize the full valueof bigdataand operate ina data-drivenway – making
decisionsbasedonthe evidence presentedbybigdataratherthan gut instinct.The benefitsof being
data-drivenare clear.Data-drivenorganizationsperformbetter,are operationallymore predictable
and are more profitable.
c)
HDFS AssumptionandGoals
I. Hardware failure
Hardware failure isnomore exception;ithasbecome aregularterm.HDFS instance consistsof
hundredsorthousandsof servermachines,eachof whichisstoringpart of the file system’sdata.
There existahuge numberof componentsthatare verysusceptibletohardware failure.Thismeans
that there are some componentsthatare alwaysnon-functional.Sothe core architectural goal of
HDFS isquickand automaticfaultdetection/recovery.
II.Streamingdata access
HDFS applicationsneedstreamingaccesstotheirdatasets.HadoopHDFS ismainlydesignedfor
batch processingratherthaninteractive use byusers.The force isonhighthroughputof data access
rather thanlowlatencyof data access.It focusesonhow to retrieve dataatthe fastestpossible
speedwhile analyzinglogs.
III.Large datasets
HDFS workswithlarge data sets.Instandard practices,a file inHDFSisof size rangingfromgigabytes
to petabytes.The architecture of HDFSshouldbe designinsucha waythat itshouldbe bestfor
storingand retrievinghuge amountsof data.HDFS shouldprovide highaggregate databandwidth
4. and shouldbe able toscale up to hundredsof nodesona single cluster.Also,itshouldbe good
enoughtodeal withtonsof millionsof filesonasingle instance.
IV.Simple coherencymodel
It workson a theoryof write-once-read-manyaccessmodelforfiles.Once the file iscreated,written,
and closed,itshouldnotbe changed.Thisresolvesthe datacoherencyissuesandenableshigh
throughputof data access.A MapReduce-basedapplicationorwebcrawlerapplicationperfectlyfits
inthismodel.Asperapache notes,there isaplan to supportappendingwritestofilesinthe future.
V.Moving computationischeaperthanmovingdata
If an applicationdoesthe computationnearthe dataitoperateson,it ismuch more efficientthan
done far of.Thisfact becomesstrongerwhile dealingwithlarge dataset.The mainadvantage of this
isthat it increasesthe overall throughputof the system.Italsominimizesnetworkcongestion.The
assumptionisthatit isbetterto move computationclosertodatainsteadof movingdatato
computation.
VI.Portabilityacrossheterogeneoushardware andsoftware platforms
HDFS isdesignedwiththe portable propertysothatit shouldbe portable fromone platformto
another.Thisenablesthe widespreadadoptionof HDFS.Itisthe bestplatformwhile dealingwitha
large setof data.