BIG DATA 
8/2014 
jpl@hearandknow.eu
Table of contents 
①Definition : what is big data? 
②Dimensionning uses 
③Why should I be interested? 
•Market study 
④How can it benefit 
•Companies 
•Consumers/citizen 
•Society 
⑤Infrastructure 
•Gathering 
•Storage 
•Networks 
•Processing 
⑥Tools 
⑦Data models and predictive analytics 
⑧From big data to smart data
DEFINITIONS
Cost 
Time 
Users 
Time-keeper genesis
Hype curve 2014 (copyright Gartner)
Is big fata the last buzzword bingo? 
①http://www.bullshitbingo.net/cards/buzzword/
Why data science?
http://www.personalizemedia.com/garys-social-media-count/
3V by SAP 
CRM* data 
GPS 
Demand 
Speed 
Velocity 
Transactions 
Opportunities 
Service calls 
Customer 
Sales orders 
Inventory 
E-mails 
Tweets 
Planning 
Things 
Mobile 
Instant messages 
Velocity 
Volume 
Variety
What is big data? (original idea Gartner) 
①Volume 
②Variety 
•Structured/unstructured 
•Public/Private 
•Text/image/sound… 
③Velocity 
•Generated 
•Captured 
•Shared 
④Veracity
Four « V » by IBM
Volume 
①Since first wave 10.000 years ago until 1950, mankind created only 5 exabytes 
②1 EB = 1 000 000 000 000 000 000B = 
③1 000 000 000 gigabytes = 
④1 000 000 terabytes = 
⑤1 000 petabytes... 
⑥Nowaday we produce 5 exabytes every 2 days!
Volume (cf. Wikipedia) 
①According to an IDC study sponsorised by EMC Gartner, digital data created in the world were 
②1,2 zettabytes/year in 2010 to 
③1,8 zettabytes in 2011, and 
④2,8 zettabytes in 2012 and up to 
⑤40 zettabytes in 2020.
Volume : storage
Variety : data classifications 
①By structure 
•Structured (SQL like data bases) # 20% 
•Unstructured # 80% 
②By source 
•Human originated 
•Non-human originated 
•In-house 
•From outside 
③By movement 
•Data in motion 
•Data at rest
Velocity : example of cellular data rates 
①2G 
•GPRS : 140,6 Kbps 
•Edge : 473,6 Kbps 
②3G 
•UMTS : 384 Kbps 
•HSPA : 14,4 Mbps 
③4G 
•HSPA+ : 42,2 Mbps 
•LTE : 173 Mbps 
④What’s next?
Big data cycle management 
Capture 
Organize 
Integrate 
Analyze 
Act
Intelligence cycle (source C.I.A.) : similarity? 
①UKUSA agreement shares 
•facilities, 
•tasks and 
•product 
②between participating governments. 
③What about analysis? 
2014 Hear & Know 2014 
18
Intelligence cycle applied to Sigint 
Interception of messages and communications data (meta data) 
Processing 
•Traffic analysis of communications data (who is communicating with who) 
•Cryptanalysis 
•Analysis of the content of messages 
Analysis with the use of other sources, for example Open Source Intelligence (OSINT) 
Dissemination 
Planning & direction 
2014 Hear & Know 2014 
19
Big data « food chain » 
Personal data 
Contacts/Calendar 
Audio/Voice/Music 
Mails/Notes 
Photos/Videos 
Identifiers/Metadata 
Positions 
Navigation history 
Biometric data (fingerprints, voice…) 
Games data 
Storage 
Terminal 
Servers 
Data centers 
Cloud 
Accessibility 
Operators 
radio (short or long range) 
API 
OS 
Development kit 
Applications 
Millions of applications in Apple store and Google Play
Structured data: human generated 
①Input 
②Click stream 
③Gaming related (moves) 
④Quantified self
Structured data : machine/computer generated 
①Sensor 
②Smart meters 
③Weblog 
④Point of sale 
⑤Financial
WHY BIG DATA?
Unsolved problems with « classical » means 
①Search engines with RBDMS -> Google own solutions
Moore’s law miniaturisation and its limits 
①« number of integrated transistors on a silicon is multiplied by 4 every 3 years » Moore 1965 
②Roch’s law : chip manufacturing costs double every 4 years 
③Below 20 nm : quantum effects
After the end of Moore’s law 
①#2020 limits of classical « engraving » physics = > necessary evolutions 
•Pessimistic scenario 
oInnovation applicative/architecture 
oCost/price erosion 
•Substitution technologies 
oBiology/électronique moléculaire organique 
oADN 
oNeuronal/analog 
osupraconductors 
oOptics 
oComponents with one or few electrons 
oQuantum computers 
oNano-technologies 
o…
DIMENSIONNING EXAMPLES
The « historical » big data crunchers 
①Simulation (nuclear…) 
②Meteo 
③Sigint 
④The National Security Agency is building the biggest building on earth : the Utah Data Center. Scheduled for yottabytes of internet collected data. 
⑤Cryptoanalysis
Amazon 
①From on-line library to global IT provider 
②Big data : 
•User initially then 
•Provider
Cellular : potential analysis 
①Cell activity for urbanism and network planification 
②Policy makers 
③Urban planners 
④Traffic engineers 
⑤Weather forecast 
⑥…
Quantified self/Lifelogging 
①Position 
②Sleeping hours 
③Tension/ cardio frequency 
④Podometer 
⑤Accelerometer/speedometer/distance 
⑥Food/beverage 
⑦Temperature 
⑧Weigh 
⑨Size 
⑩Photo 
11Voice recording
Trading 
①High frequency/Speed trading (non distributed)
EDF and metering 
①Previous situation 
•35 M houses in France 
•2 « relevés » / year 
②Remote metering 
•Every 30 mn 
③Result 
•Expected spare : xxx MW
Cybersecurity Hypervision 
①Digitalattackmaps 
②norse
Big data applications 
①Social media analytics : impressive example : Linked in « people you may know » 
②Voice analytics : call centers, mobile phones (SIRI) 
③Text analytics 
④Video analytics 
⑤Telecom : customer churn 
⑥Behavioural analytics
Marketing 
①Knowledge 
•Brand 
•Competitors 
•Customers 
•Anticipate the market 
•A/B testing
Marketing
Geolocation 
①Skyhook (Google, Apple…) data base may be used to observe people movement
Intelligence family 
Int 
Hum 
OS 
Im 
Sig 
2014 Hear & Know 2014 
39
Sigint and family 
Techint 
Sigint 
Comint 
Elint 
Masint 
Imint 
… 
2014 Hear & Know 2014 
40
Politics 
①Obama re-election 
②In 2013, Big Data is one of the « 7 ambitions stratégiques de la France » according to the Commission innovation 2030
Science 
① “Square Kilometre Array” radiotelescope will deliver 50 terabytes analyzed data/day, with 7 000 raw data terabytes/s 
②Large Hadron Collider has around 150 millions sensors producing data 40 millions/s. 
③# 600 millions collisions/s , after filters, 100 interesting collisions remains /s. There are 25 Pbytes to store/year
IOT/M2M/IOO 
①According to Yole, Internet of Things will represent 15% of processed data in 2024 
②Electronics components will jump from 9,5 G$ in 2014 to 46 G$ in 2024
IoO 
« In 2020, there will be 80 billions, according to Samuel Ropert from Idate. 
IoO alone will count for , 85% of IoT, 
11% for terminals and 4% for M2M. 
Expected annual growth between 2010 and 2020 
IoO 41%, 
terminals 22% 
M2M 16%.
Roadmap of the Internet of things
MARKET STUDY
Big data growth 
①Annual growth for Big Data for 2011-2016 is expected 31.7%. 
②Market should reach 23,8 G$ in 2016 (source : IDC march 2013). 
③Big Data should be 8% of european GNP in 2020 (AFDEL february 2013).
Risks and opportunities 
①How to create value with this data flooding? 
②If you don’t do it yourself on your market : advantage to the first mover. 
③Democracy risks : end of privacy ? Dictature based on data ?
INFRASTRUCTURE
Software approach 
Traditional 
Monolithic 
Centralised storage 
RDBMS 
Data frame/format preliminary 
Proprietary 
Big data 
Distributed 
Storage and execution at node level 
Brute data processing 
opensource
Hardware approach 
Traditional hardware 
Specific hardware 
Big central server 
NAS 
Raid 
Expensive 
Uneasy evolution 
Big data 
Basic hardware 
Pizza boxes 
Ethernet 
JBOD 
Unexpensive 
Easy evolution
Big data stack (copyright big data for dummies)
Infrastructure criteria 
①Performance 
②Availability 
③Scalability 
④Flexibility 
⑤Cost 
⑥Redundancy + resiliance
Storage caracteristics 
Caracteristics 
RDBMS 
Big Data 
Data size 
Giga bytes 
Peta bytes 
Access 
Interactive 
Near real time or batch 
Scheme/structure 
Static 
Dynamic 
Language 
SQL 
UQL/Procedural (Java, C++…) 
Job scheduling 
Hard 
Simple 
Integrity 
High 
High 
Scaling 
Non linear 
linear
TOOLS
Tools 
①Hadoop 
•MapReduce 
②PostgreSQL (www.postgresql.org) 
③R 
④Matlab 
⑤Analyst Notebook 
⑥Watson
Hadoop 
①Opensource 
②Fast (parallel processing) 
③Main components 
•Distributed file system 
•MapReduce engine
Mapreduce (made in Google) 
①Map 
②Reduce
Why R? 
①Open 
②www.r-project.org 
③www.rstudio.com
Big data, and after? 
①Open data 
②Smart data 
③Linked data
BACK-UP
The cloud 
①Shared resources 
②Applications 
③Computing 
④Storage 
⑤Networking 
⑥Development and deployment platforms
Cloud vocabulary for delivery models 
①IaaS : infrastructure as a service 
②PaaS : platform as a service 
③SaaS : software as a service 
④And specially useful for big data 
⑤DaaS : data as a service
Cloud players 
①Worldwide 
•Google 
•Apple 
•Microsoft 
•Amazon 
•Openstack 
•Dropbox… 
②France 
•Cloudwatt 
•Numergy
Improvement 
①Data modelling 
②Data management
At stake 
31/08/2014 
68
①To further the cause of promoting awareness to the future impact of IoT, let’s answer these three key questions: 
•What kind of data are these devices collecting? 
•What are the different types of “Things” or categories that are getting connected? 
•What are the different use cases that are driving the revenue predictions?

Introduction to big data

  • 1.
    BIG DATA 8/2014 jpl@hearandknow.eu
  • 2.
    Table of contents ①Definition : what is big data? ②Dimensionning uses ③Why should I be interested? •Market study ④How can it benefit •Companies •Consumers/citizen •Society ⑤Infrastructure •Gathering •Storage •Networks •Processing ⑥Tools ⑦Data models and predictive analytics ⑧From big data to smart data
  • 3.
  • 4.
    Cost Time Users Time-keeper genesis
  • 5.
    Hype curve 2014(copyright Gartner)
  • 6.
    Is big fatathe last buzzword bingo? ①http://www.bullshitbingo.net/cards/buzzword/
  • 7.
  • 8.
  • 9.
    3V by SAP CRM* data GPS Demand Speed Velocity Transactions Opportunities Service calls Customer Sales orders Inventory E-mails Tweets Planning Things Mobile Instant messages Velocity Volume Variety
  • 10.
    What is bigdata? (original idea Gartner) ①Volume ②Variety •Structured/unstructured •Public/Private •Text/image/sound… ③Velocity •Generated •Captured •Shared ④Veracity
  • 11.
    Four « V» by IBM
  • 12.
    Volume ①Since firstwave 10.000 years ago until 1950, mankind created only 5 exabytes ②1 EB = 1 000 000 000 000 000 000B = ③1 000 000 000 gigabytes = ④1 000 000 terabytes = ⑤1 000 petabytes... ⑥Nowaday we produce 5 exabytes every 2 days!
  • 13.
    Volume (cf. Wikipedia) ①According to an IDC study sponsorised by EMC Gartner, digital data created in the world were ②1,2 zettabytes/year in 2010 to ③1,8 zettabytes in 2011, and ④2,8 zettabytes in 2012 and up to ⑤40 zettabytes in 2020.
  • 14.
  • 15.
    Variety : dataclassifications ①By structure •Structured (SQL like data bases) # 20% •Unstructured # 80% ②By source •Human originated •Non-human originated •In-house •From outside ③By movement •Data in motion •Data at rest
  • 16.
    Velocity : exampleof cellular data rates ①2G •GPRS : 140,6 Kbps •Edge : 473,6 Kbps ②3G •UMTS : 384 Kbps •HSPA : 14,4 Mbps ③4G •HSPA+ : 42,2 Mbps •LTE : 173 Mbps ④What’s next?
  • 17.
    Big data cyclemanagement Capture Organize Integrate Analyze Act
  • 18.
    Intelligence cycle (sourceC.I.A.) : similarity? ①UKUSA agreement shares •facilities, •tasks and •product ②between participating governments. ③What about analysis? 2014 Hear & Know 2014 18
  • 19.
    Intelligence cycle appliedto Sigint Interception of messages and communications data (meta data) Processing •Traffic analysis of communications data (who is communicating with who) •Cryptanalysis •Analysis of the content of messages Analysis with the use of other sources, for example Open Source Intelligence (OSINT) Dissemination Planning & direction 2014 Hear & Know 2014 19
  • 20.
    Big data «food chain » Personal data Contacts/Calendar Audio/Voice/Music Mails/Notes Photos/Videos Identifiers/Metadata Positions Navigation history Biometric data (fingerprints, voice…) Games data Storage Terminal Servers Data centers Cloud Accessibility Operators radio (short or long range) API OS Development kit Applications Millions of applications in Apple store and Google Play
  • 21.
    Structured data: humangenerated ①Input ②Click stream ③Gaming related (moves) ④Quantified self
  • 22.
    Structured data :machine/computer generated ①Sensor ②Smart meters ③Weblog ④Point of sale ⑤Financial
  • 23.
  • 24.
    Unsolved problems with« classical » means ①Search engines with RBDMS -> Google own solutions
  • 25.
    Moore’s law miniaturisationand its limits ①« number of integrated transistors on a silicon is multiplied by 4 every 3 years » Moore 1965 ②Roch’s law : chip manufacturing costs double every 4 years ③Below 20 nm : quantum effects
  • 26.
    After the endof Moore’s law ①#2020 limits of classical « engraving » physics = > necessary evolutions •Pessimistic scenario oInnovation applicative/architecture oCost/price erosion •Substitution technologies oBiology/électronique moléculaire organique oADN oNeuronal/analog osupraconductors oOptics oComponents with one or few electrons oQuantum computers oNano-technologies o…
  • 27.
  • 28.
    The « historical» big data crunchers ①Simulation (nuclear…) ②Meteo ③Sigint ④The National Security Agency is building the biggest building on earth : the Utah Data Center. Scheduled for yottabytes of internet collected data. ⑤Cryptoanalysis
  • 29.
    Amazon ①From on-linelibrary to global IT provider ②Big data : •User initially then •Provider
  • 30.
    Cellular : potentialanalysis ①Cell activity for urbanism and network planification ②Policy makers ③Urban planners ④Traffic engineers ⑤Weather forecast ⑥…
  • 31.
    Quantified self/Lifelogging ①Position ②Sleeping hours ③Tension/ cardio frequency ④Podometer ⑤Accelerometer/speedometer/distance ⑥Food/beverage ⑦Temperature ⑧Weigh ⑨Size ⑩Photo 11Voice recording
  • 32.
    Trading ①High frequency/Speedtrading (non distributed)
  • 33.
    EDF and metering ①Previous situation •35 M houses in France •2 « relevés » / year ②Remote metering •Every 30 mn ③Result •Expected spare : xxx MW
  • 34.
  • 35.
    Big data applications ①Social media analytics : impressive example : Linked in « people you may know » ②Voice analytics : call centers, mobile phones (SIRI) ③Text analytics ④Video analytics ⑤Telecom : customer churn ⑥Behavioural analytics
  • 36.
    Marketing ①Knowledge •Brand •Competitors •Customers •Anticipate the market •A/B testing
  • 37.
  • 38.
    Geolocation ①Skyhook (Google,Apple…) data base may be used to observe people movement
  • 39.
    Intelligence family Int Hum OS Im Sig 2014 Hear & Know 2014 39
  • 40.
    Sigint and family Techint Sigint Comint Elint Masint Imint … 2014 Hear & Know 2014 40
  • 41.
    Politics ①Obama re-election ②In 2013, Big Data is one of the « 7 ambitions stratégiques de la France » according to the Commission innovation 2030
  • 42.
    Science ① “SquareKilometre Array” radiotelescope will deliver 50 terabytes analyzed data/day, with 7 000 raw data terabytes/s ②Large Hadron Collider has around 150 millions sensors producing data 40 millions/s. ③# 600 millions collisions/s , after filters, 100 interesting collisions remains /s. There are 25 Pbytes to store/year
  • 43.
    IOT/M2M/IOO ①According toYole, Internet of Things will represent 15% of processed data in 2024 ②Electronics components will jump from 9,5 G$ in 2014 to 46 G$ in 2024
  • 45.
    IoO « In2020, there will be 80 billions, according to Samuel Ropert from Idate. IoO alone will count for , 85% of IoT, 11% for terminals and 4% for M2M. Expected annual growth between 2010 and 2020 IoO 41%, terminals 22% M2M 16%.
  • 46.
    Roadmap of theInternet of things
  • 47.
  • 49.
    Big data growth ①Annual growth for Big Data for 2011-2016 is expected 31.7%. ②Market should reach 23,8 G$ in 2016 (source : IDC march 2013). ③Big Data should be 8% of european GNP in 2020 (AFDEL february 2013).
  • 50.
    Risks and opportunities ①How to create value with this data flooding? ②If you don’t do it yourself on your market : advantage to the first mover. ③Democracy risks : end of privacy ? Dictature based on data ?
  • 51.
  • 52.
    Software approach Traditional Monolithic Centralised storage RDBMS Data frame/format preliminary Proprietary Big data Distributed Storage and execution at node level Brute data processing opensource
  • 53.
    Hardware approach Traditionalhardware Specific hardware Big central server NAS Raid Expensive Uneasy evolution Big data Basic hardware Pizza boxes Ethernet JBOD Unexpensive Easy evolution
  • 54.
    Big data stack(copyright big data for dummies)
  • 55.
    Infrastructure criteria ①Performance ②Availability ③Scalability ④Flexibility ⑤Cost ⑥Redundancy + resiliance
  • 56.
    Storage caracteristics Caracteristics RDBMS Big Data Data size Giga bytes Peta bytes Access Interactive Near real time or batch Scheme/structure Static Dynamic Language SQL UQL/Procedural (Java, C++…) Job scheduling Hard Simple Integrity High High Scaling Non linear linear
  • 57.
  • 58.
    Tools ①Hadoop •MapReduce ②PostgreSQL (www.postgresql.org) ③R ④Matlab ⑤Analyst Notebook ⑥Watson
  • 59.
    Hadoop ①Opensource ②Fast(parallel processing) ③Main components •Distributed file system •MapReduce engine
  • 60.
    Mapreduce (made inGoogle) ①Map ②Reduce
  • 61.
    Why R? ①Open ②www.r-project.org ③www.rstudio.com
  • 62.
    Big data, andafter? ①Open data ②Smart data ③Linked data
  • 63.
  • 64.
    The cloud ①Sharedresources ②Applications ③Computing ④Storage ⑤Networking ⑥Development and deployment platforms
  • 65.
    Cloud vocabulary fordelivery models ①IaaS : infrastructure as a service ②PaaS : platform as a service ③SaaS : software as a service ④And specially useful for big data ⑤DaaS : data as a service
  • 66.
    Cloud players ①Worldwide •Google •Apple •Microsoft •Amazon •Openstack •Dropbox… ②France •Cloudwatt •Numergy
  • 67.
    Improvement ①Data modelling ②Data management
  • 68.
  • 69.
    ①To further thecause of promoting awareness to the future impact of IoT, let’s answer these three key questions: •What kind of data are these devices collecting? •What are the different types of “Things” or categories that are getting connected? •What are the different use cases that are driving the revenue predictions?