Tutorial 9
a)
NewSQLisa class of relational database managementsystemsthatseektoprovide the scalabilityof
NoSQLsystemsforonline transactionprocessing(OLTP) workloadswhile maintainingthe ACID
guaranteesof a traditional database system.... NewSQLsystemsattempttoreconcile the conflicts.
b)
There are three definingpropertiesthatcan helpbreakdownthe term.Dubbedthe three Vs;
volume,velocity,andvariety,these are keytounderstandinghow we canmeasure bigdataand just
howvery different‘bigdata’istooldfashioneddata.
Volume
The most obviousone iswhere we’ll start.Bigdatais aboutvolume.Volumesof datathatcan reach
unprecedentedheightsinfact.It’sestimatedthat2.5quintillionbytesof dataiscreatedeachday,
and as a result,there will be 40zettabytesof data createdby2020 – whichhighlightsanincrease of
300 timesfrom2005. As a result,itisnow not uncommonforlarge companiestohave Terabytes –
and evenPetabytes –of data instorage devicesandonservers.Thisdatahelpsto shape the future
of a companyand itsactions,all while trackingprogress.
Velocity
The growth of data, and the resultingimportanceof it,haschangedthe way we see data.There once
was a time whenwe didn’tsee the importance of datainthe corporate world,butwiththe change
of howwe gatherit,we’ve come torelyon it dayto day. Velocityessentiallymeasureshow fastthe
data iscomingin.Some data will come ininreal-time,whereasotherwill come infitsandstarts,
sentto us inbatches.Andas not all platformswill experience the incomingdataatthe same pace,
it’simportantnotto generalise,discount,orjumptoconclusionswithouthavingall the factsand
figures.
Variety
Data was once collectedfromone place anddeliveredinone format.Once takingthe shape of
database files - suchas, excel,csvandaccess - it isnow beingpresentedinnon-traditionalforms,like
video,text,pdf,andgraphicsonsocial media,aswell asviatechsuch as wearable devices.Although
thisdata isextremelyusefultous,itdoescreate more workand require more analytical skillsto
decipherthisincomingdata,make itmanageable andallow ittowork.
1) Seta bigdata strategy
At a highlevel,abigdata strategyisa plandesignedtohelpyouoversee andimprovethe wayyou
acquire,store,manage,share anduse data withinandoutside of yourorganization.A bigdata
strategysetsthe stage for businesssuccessamidanabundance of data.Whendevelopingastrategy,
it’simportantto considerexisting –and future – businessandtechnologygoalsandinitiatives.This
callsfor treatingbigdata like anyothervaluable businessassetratherthanjusta byproductof
applications.
Big Data Infographic
Clickon the infographictolearnmore aboutbigdata.
2) Knowthe sourcesof bigdata
Streamingdatacomesfrom the Internetof Things(IoT) andotherconnecteddevicesthatflow into
IT systemsfromwearables,smartcars,medical devices,industrial equipmentandmore.Youcan
analyze thisbigdata as itarrives,decidingwhichdatatokeepornot keep,andwhichneedsfurther
analysis.
Social mediadatastemsfrominteractionsonFacebook,YouTube,Instagram, etc.Thisincludesvast
amountsof big data inthe form of images,videos,voice,textandsound –useful formarketing,sales
and supportfunctions.Thisdataisofteninunstructuredorsemistructuredforms,soitposesa
unique challenge forconsumptionandanalysis.
Publiclyavailabledatacomesfrommassive amountsof opendatasourceslike the USgovernment’s
data.gov,the CIA World Factbookor the EuropeanUnionOpenData Portal.
Otherbigdata may come from data lakes,clouddatasources,suppliersandcustomers.
3) Access,manage and store bigdata
Moderncomputingsystemsprovide the speed,powerandflexibilityneededto quicklyaccess
massive amountsandtypesof bigdata. Alongwithreliable access,companiesalsoneedmethodsfor
integratingthe data,ensuringdataquality,providingdatagovernance andstorage,andpreparing
the data for analytics.Some datamaybe storedon-premisesinatraditional datawarehouse –but
there are also flexible,low-costoptionsforstoringandhandlingbigdataviacloudsolutions,data
lakesandHadoop.
4) Analyze bigdata
Withhigh-performance technologieslike gridcomputingorin-memoryanalytics,organizationscan
choose to use all theirbigdata for analyses.Anotherapproachistodetermine upfrontwhichdatais
relevantbefore analyzingit.Eitherway,bigdataanalyticsishow companiesgainvalue andinsights
fromdata. Increasingly,bigdatafeedstoday’sadvancedanalyticsendeavorssuchasartificial
intelligence.
5) Make intelligent,data-drivendecisions
Well-managed,trusteddataleadstotrustedanalyticsandtrusteddecisions.Tostaycompetitive,
businessesneedto seize the full valueof bigdataand operate ina data-drivenway – making
decisionsbasedonthe evidence presentedbybigdataratherthan gut instinct.The benefitsof being
data-drivenare clear.Data-drivenorganizationsperformbetter,are operationallymore predictable
and are more profitable.
c)
HDFS AssumptionandGoals
I. Hardware failure
Hardware failure isnomore exception;ithasbecome aregularterm.HDFS instance consistsof
hundredsorthousandsof servermachines,eachof whichisstoringpart of the file system’sdata.
There existahuge numberof componentsthatare verysusceptibletohardware failure.Thismeans
that there are some componentsthatare alwaysnon-functional.Sothe core architectural goal of
HDFS isquickand automaticfaultdetection/recovery.
II.Streamingdata access
HDFS applicationsneedstreamingaccesstotheirdatasets.HadoopHDFS ismainlydesignedfor
batch processingratherthaninteractive use byusers.The force isonhighthroughputof data access
rather thanlowlatencyof data access.It focusesonhow to retrieve dataatthe fastestpossible
speedwhile analyzinglogs.
III.Large datasets
HDFS workswithlarge data sets.Instandard practices,a file inHDFSisof size rangingfromgigabytes
to petabytes.The architecture of HDFSshouldbe designinsucha waythat itshouldbe bestfor
storingand retrievinghuge amountsof data.HDFS shouldprovide highaggregate databandwidth
and shouldbe able toscale up to hundredsof nodesona single cluster.Also,itshouldbe good
enoughtodeal withtonsof millionsof filesonasingle instance.
IV.Simple coherencymodel
It workson a theoryof write-once-read-manyaccessmodelforfiles.Once the file iscreated,written,
and closed,itshouldnotbe changed.Thisresolvesthe datacoherencyissuesandenableshigh
throughputof data access.A MapReduce-basedapplicationorwebcrawlerapplicationperfectlyfits
inthismodel.Asperapache notes,there isaplan to supportappendingwritestofilesinthe future.
V.Moving computationischeaperthanmovingdata
If an applicationdoesthe computationnearthe dataitoperateson,it ismuch more efficientthan
done far of.Thisfact becomesstrongerwhile dealingwithlarge dataset.The mainadvantage of this
isthat it increasesthe overall throughputof the system.Italsominimizesnetworkcongestion.The
assumptionisthatit isbetterto move computationclosertodatainsteadof movingdatato
computation.
VI.Portabilityacrossheterogeneoushardware andsoftware platforms
HDFS isdesignedwiththe portable propertysothatit shouldbe portable fromone platformto
another.Thisenablesthe widespreadadoptionof HDFS.Itisthe bestplatformwhile dealingwitha
large setof data.

T9

  • 1.
    Tutorial 9 a) NewSQLisa classof relational database managementsystemsthatseektoprovide the scalabilityof NoSQLsystemsforonline transactionprocessing(OLTP) workloadswhile maintainingthe ACID guaranteesof a traditional database system.... NewSQLsystemsattempttoreconcile the conflicts. b) There are three definingpropertiesthatcan helpbreakdownthe term.Dubbedthe three Vs; volume,velocity,andvariety,these are keytounderstandinghow we canmeasure bigdataand just howvery different‘bigdata’istooldfashioneddata. Volume The most obviousone iswhere we’ll start.Bigdatais aboutvolume.Volumesof datathatcan reach unprecedentedheightsinfact.It’sestimatedthat2.5quintillionbytesof dataiscreatedeachday, and as a result,there will be 40zettabytesof data createdby2020 – whichhighlightsanincrease of 300 timesfrom2005. As a result,itisnow not uncommonforlarge companiestohave Terabytes – and evenPetabytes –of data instorage devicesandonservers.Thisdatahelpsto shape the future of a companyand itsactions,all while trackingprogress. Velocity The growth of data, and the resultingimportanceof it,haschangedthe way we see data.There once was a time whenwe didn’tsee the importance of datainthe corporate world,butwiththe change of howwe gatherit,we’ve come torelyon it dayto day. Velocityessentiallymeasureshow fastthe data iscomingin.Some data will come ininreal-time,whereasotherwill come infitsandstarts, sentto us inbatches.Andas not all platformswill experience the incomingdataatthe same pace, it’simportantnotto generalise,discount,orjumptoconclusionswithouthavingall the factsand figures. Variety Data was once collectedfromone place anddeliveredinone format.Once takingthe shape of database files - suchas, excel,csvandaccess - it isnow beingpresentedinnon-traditionalforms,like video,text,pdf,andgraphicsonsocial media,aswell asviatechsuch as wearable devices.Although
  • 2.
    thisdata isextremelyusefultous,itdoescreate moreworkand require more analytical skillsto decipherthisincomingdata,make itmanageable andallow ittowork. 1) Seta bigdata strategy At a highlevel,abigdata strategyisa plandesignedtohelpyouoversee andimprovethe wayyou acquire,store,manage,share anduse data withinandoutside of yourorganization.A bigdata strategysetsthe stage for businesssuccessamidanabundance of data.Whendevelopingastrategy, it’simportantto considerexisting –and future – businessandtechnologygoalsandinitiatives.This callsfor treatingbigdata like anyothervaluable businessassetratherthanjusta byproductof applications. Big Data Infographic Clickon the infographictolearnmore aboutbigdata. 2) Knowthe sourcesof bigdata Streamingdatacomesfrom the Internetof Things(IoT) andotherconnecteddevicesthatflow into IT systemsfromwearables,smartcars,medical devices,industrial equipmentandmore.Youcan analyze thisbigdata as itarrives,decidingwhichdatatokeepornot keep,andwhichneedsfurther analysis. Social mediadatastemsfrominteractionsonFacebook,YouTube,Instagram, etc.Thisincludesvast amountsof big data inthe form of images,videos,voice,textandsound –useful formarketing,sales and supportfunctions.Thisdataisofteninunstructuredorsemistructuredforms,soitposesa unique challenge forconsumptionandanalysis. Publiclyavailabledatacomesfrommassive amountsof opendatasourceslike the USgovernment’s data.gov,the CIA World Factbookor the EuropeanUnionOpenData Portal. Otherbigdata may come from data lakes,clouddatasources,suppliersandcustomers. 3) Access,manage and store bigdata Moderncomputingsystemsprovide the speed,powerandflexibilityneededto quicklyaccess massive amountsandtypesof bigdata. Alongwithreliable access,companiesalsoneedmethodsfor integratingthe data,ensuringdataquality,providingdatagovernance andstorage,andpreparing the data for analytics.Some datamaybe storedon-premisesinatraditional datawarehouse –but there are also flexible,low-costoptionsforstoringandhandlingbigdataviacloudsolutions,data lakesandHadoop.
  • 3.
    4) Analyze bigdata Withhigh-performancetechnologieslike gridcomputingorin-memoryanalytics,organizationscan choose to use all theirbigdata for analyses.Anotherapproachistodetermine upfrontwhichdatais relevantbefore analyzingit.Eitherway,bigdataanalyticsishow companiesgainvalue andinsights fromdata. Increasingly,bigdatafeedstoday’sadvancedanalyticsendeavorssuchasartificial intelligence. 5) Make intelligent,data-drivendecisions Well-managed,trusteddataleadstotrustedanalyticsandtrusteddecisions.Tostaycompetitive, businessesneedto seize the full valueof bigdataand operate ina data-drivenway – making decisionsbasedonthe evidence presentedbybigdataratherthan gut instinct.The benefitsof being data-drivenare clear.Data-drivenorganizationsperformbetter,are operationallymore predictable and are more profitable. c) HDFS AssumptionandGoals I. Hardware failure Hardware failure isnomore exception;ithasbecome aregularterm.HDFS instance consistsof hundredsorthousandsof servermachines,eachof whichisstoringpart of the file system’sdata. There existahuge numberof componentsthatare verysusceptibletohardware failure.Thismeans that there are some componentsthatare alwaysnon-functional.Sothe core architectural goal of HDFS isquickand automaticfaultdetection/recovery. II.Streamingdata access HDFS applicationsneedstreamingaccesstotheirdatasets.HadoopHDFS ismainlydesignedfor batch processingratherthaninteractive use byusers.The force isonhighthroughputof data access rather thanlowlatencyof data access.It focusesonhow to retrieve dataatthe fastestpossible speedwhile analyzinglogs. III.Large datasets HDFS workswithlarge data sets.Instandard practices,a file inHDFSisof size rangingfromgigabytes to petabytes.The architecture of HDFSshouldbe designinsucha waythat itshouldbe bestfor storingand retrievinghuge amountsof data.HDFS shouldprovide highaggregate databandwidth
  • 4.
    and shouldbe abletoscale up to hundredsof nodesona single cluster.Also,itshouldbe good enoughtodeal withtonsof millionsof filesonasingle instance. IV.Simple coherencymodel It workson a theoryof write-once-read-manyaccessmodelforfiles.Once the file iscreated,written, and closed,itshouldnotbe changed.Thisresolvesthe datacoherencyissuesandenableshigh throughputof data access.A MapReduce-basedapplicationorwebcrawlerapplicationperfectlyfits inthismodel.Asperapache notes,there isaplan to supportappendingwritestofilesinthe future. V.Moving computationischeaperthanmovingdata If an applicationdoesthe computationnearthe dataitoperateson,it ismuch more efficientthan done far of.Thisfact becomesstrongerwhile dealingwithlarge dataset.The mainadvantage of this isthat it increasesthe overall throughputof the system.Italsominimizesnetworkcongestion.The assumptionisthatit isbetterto move computationclosertodatainsteadof movingdatato computation. VI.Portabilityacrossheterogeneoushardware andsoftware platforms HDFS isdesignedwiththe portable propertysothatit shouldbe portable fromone platformto another.Thisenablesthe widespreadadoptionof HDFS.Itisthe bestplatformwhile dealingwitha large setof data.