SlideShare a Scribd company logo
Big Data Rampage !
NIKO VUOKKO
13 MAY 2013, HIIT SEMINAR
The data
2
About that data of yours…
• Researchers generally live in a nice utopia where data just works *
*Yes, you do munge it for days, that’s nice
Reality
check
3
What if you suddenly notice that there’s
• … corrupted JSON/XML/whatever
• … corrupted ids
• … transient ids
• … 5 different transient ids
• … text in number fields
• … new fields
• … disappeared fields
• … fields whose meaning just changed
• … but you have no idea of the new definition
• … all of these, regularly, without forward notice
• … and the bad data is coming at you at 1 GB per hour
• … and yours or someone else’s business depends on the data
4
You
Garbage Great insights
The data
• Enriched by many operationally attainable sources
--> varying schema and complicated ID soup
• Developed by frontline instead of IT waterfall
--> faster process, but volatile data definition
• Data scientists often requires access to more data
--> further risks of lapses
• Big and streaming in
--> risks of discontinuity
5
The Big Data
PLEASE DON’T SHOOT ME FOR USING THE TERM
6
What is big?
Human-generated
• 5K tweets / s
• 25K events / s from a mobile game (that’s 200 GB / day)
• 40K Google searches / s
Machine-generated
• 5M quotes / s in the US options market
• 120 MB / s of diagnostics from a single gas turbine
• 1 PB / s peaking from CERN LHC
7
What will be big?
• Human-generated data will get more detailed
• … but won’t grow much faster than the userbase
• It will become small eventually
• Machine-generated data will grow by the Moore’s law
• … and it’s already massive
8
How many of you consider this scale?
• Why not ?
• We already understand CPU and memory intensive problems
• But the new world out there is data intensive
• How can research stay in touch with change and stay relevant?
9
The Curriculum
RETROFITTING CS STUDIES
10
Software Architectures
• Single thread performance and disk IO hitting a wall
• How do learning algorithms scale out of this corner ?
• Stochastic methods
• Ensembles
• Online learning
11
Databases 1
• In memory: MongoDB, Exasol, Redis
• On disk (single/sharded): MySQL, PostgreSQL
• On data warehouse:TeraData, DB2, Oracle
• Distributed: HDFS, Cassandra, Riak
• Cloud: S3, Azure, GCE, OpenStack
12
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Databases 2
• Good old OLTP
• Analytic
• Key-value stores
• Document stores
• HDFS
• What is the best choice for this job ?
13
Data Structures and Algorithms
• Transforming data is expensive --> play safe with data structures
• Normalization dilemma
• Algorithms must tolerate the volatile nature of data
• Data drift, errors, missing values, outliers
• Models need to be explanatory
• Attention to complexity
• The usual obvious (CPU, memory, disk scans & seeks)
• Iterations
• Model size: What is an example of this?
14
Real-time Systems
What is real-time?
Very different requirements:
• Analyst: “What’s the user count today? By source? Now? From France?”
• Sysadmin: “Network traffic up 5x in 5 seconds!What’s going on?”
• Google: “Make a bid for these placements.You have 50 ms”
15
User Interfaces
• Operations or not, visualization is critical for acceptance
• From business concept to implementation
• What information do these users want to see ?
• How does this information support decision making ?
• How to visualize it with clarity yet powerfully ?
16
Significance Testing
• Data-driven actions must be backed by numbers
• Early analytics glazed over significance
• Executive: “Can I trust these numbers? Is my decision justified?”
• Systems must act conservatively
• Trust is built slowly, but lost quickly
• Data solutions must not screw up
17
Modeling Information Business Systems
• Understanding business and how to improve it with data
•  : business problem  data solution
• The most important quality of a data scientist
18
Contrasts
19
Hand-written Turing Machine vs Excel
• Average business has tons of low-hanging data fruit
• Developing and automating all that takes years (and years)
• No use for “advanced” stuff without visibility to the underlying
• There is no shortcut
• The organization itself needs to mature
20
Supervised vs Unsupervised
• Decide purpose of analysis now or later ?
• Most often the need is already formulated
• Here’s a standard clustering of human behavior
• Power laws will screw things up
21
Ad-hoc vs Operations
• Operative data algorithms run day and night without supervision
• Can produce massive leverage and ROI to a business
• … but they are crazy hard to develop
• Ad hoc analysis can employ all the cool stuff from last month’s JMLR
• … but they can’t scale
• … and 90 % of effort goes to communication and visualization
22
Computation Models
23
State snapshots
• User actions modify the current state in an OLTP
• Single actions go to offline audit log for re-running
• Data algorithms need to export and import data
• Things are run in batches
• What data used to be (and still often is)
24
Events
Snapshot
Snapshot
Snapshot
Data Warehouse
• Additional endpoint specialized for analytics
• Can run surprisingly many algorithms
• … because the speed is so worth the effort
25
Cloud
• “Scalable SOA for computation, networking and storage”
• Really all about strict APIs
• Service dog wagging the infrastructure tail
• Public cloud very competitive for the small guys
• Hybrid clouds increasingly replace enterprise systems
26
Event data
• Event stream itself becomes first class citizen and master-labeled
• Needs novel storage
• Needs novel processing
• Data scientists beware! Sugar high imminent!
27
Stream processing
• New data is coming in all the time
• Process it online
• Data becomes somewhat disposable
• “Why bother with month old data when there’s too much of it anyways ?”
28
Iterative processing
• Always been the problem with large data
• Keeping state in memory necessary, but hard
• Spark doesn’t solve this, but makes it less painful
• Common fix: don’t do iterations
29
Hadoop the Hairy Framework
• HDFS, ZooKeeper, MapReduce, Hive, Pig,
Yarn, Flume, Mahout, Bigtop, Oozie, Hue,
HCatalog, Avro,Whirr, Sqoop, Impala, DataFu, …
• Premise of insanely large and/or unstructured data
• You probably don’t need it
30
Will Hadoop replace the Data Warehouse?
Separate concepts: HadoopThe Framework vs. MapReduce
• MapReduce suited for totally different tasks
• Hadoop can host a data warehouse
• … but it won’t be any easier or quicker to develop
31
The Purpose
32
What does Big Data mean for a business?
• Answers … a lot more answers
• Better, more reliable decision making
• Treating customers as individuals instead of segments
• How to design processes (both business and social) to employ data?
33
Data-driven decision making
34
Thank you!
• Always eager to talk about this stuff, feel free to contact !
• Now it’s time for lots of questions !
• niko.vuokko@gmail.com
• linkedin.com/in/nikovuokko
• @nikovuokko
35

More Related Content

What's hot

What makes an effective data team?
What makes an effective data team?What makes an effective data team?
What makes an effective data team?
Snowplow Analytics
 
Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Haluan Irsad
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
Dhiana Deva
 
Agile data science
Agile data scienceAgile data science
Agile data science
Joel Horwitz
 
From Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringFrom Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data Engineering
Ry Walker
 
Choosing data warehouse considerations
Choosing data warehouse considerationsChoosing data warehouse considerations
Choosing data warehouse considerations
Aseem Bansal
 
Bi 2.0 hadoop everywhere
Bi 2.0   hadoop everywhereBi 2.0   hadoop everywhere
Bi 2.0 hadoop everywhere
Dmitry Tolpeko
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
BigDataEverywhere
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku
 
Big Data Analytics for BI, BA and QA
Big Data Analytics for BI, BA and QABig Data Analytics for BI, BA and QA
Big Data Analytics for BI, BA and QA
Dmitry Tolpeko
 
Hadoop Meets Scrum
Hadoop Meets ScrumHadoop Meets Scrum
Hadoop Meets Scrum
Rommel Garcia
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECH
Dataiku
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
Arvind Kalyan
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku
 
Increasing the Efficiency of Workflows: Use Cases in the Life Sciences
Increasing the Efficiency of Workflows: Use Cases in the Life SciencesIncreasing the Efficiency of Workflows: Use Cases in the Life Sciences
Increasing the Efficiency of Workflows: Use Cases in the Life Sciences
Sandra Gesing
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
Dataiku
 
H2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonH2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in Python
Sri Ambati
 
Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...
Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...
Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...
Matt Stubbs
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015
Dataiku
 

What's hot (20)

What makes an effective data team?
What makes an effective data team?What makes an effective data team?
What makes an effective data team?
 
Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02Stephen Dillon - Fast Data Presentation Sept 02
Stephen Dillon - Fast Data Presentation Sept 02
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Agile data science
Agile data scienceAgile data science
Agile data science
 
From Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringFrom Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data Engineering
 
Choosing data warehouse considerations
Choosing data warehouse considerationsChoosing data warehouse considerations
Choosing data warehouse considerations
 
Bi 2.0 hadoop everywhere
Bi 2.0   hadoop everywhereBi 2.0   hadoop everywhere
Bi 2.0 hadoop everywhere
 
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
Big Data Analytics for BI, BA and QA
Big Data Analytics for BI, BA and QABig Data Analytics for BI, BA and QA
Big Data Analytics for BI, BA and QA
 
Hadoop Meets Scrum
Hadoop Meets ScrumHadoop Meets Scrum
Hadoop Meets Scrum
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECH
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
 
Increasing the Efficiency of Workflows: Use Cases in the Life Sciences
Increasing the Efficiency of Workflows: Use Cases in the Life SciencesIncreasing the Efficiency of Workflows: Use Cases in the Life Sciences
Increasing the Efficiency of Workflows: Use Cases in the Life Sciences
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
 
H2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in PythonH2O for Medicine and Intro to H2O in Python
H2O for Medicine and Intro to H2O in Python
 
Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...
Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...
Big Data LDN 2018: HOW RANK GAMING PRODUCTIONISED & AUTOMATED THE MANAGEMENT ...
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015
 

Viewers also liked

Analytiikka bisneksessä
Analytiikka bisneksessäAnalytiikka bisneksessä
Analytiikka bisneksessä
Niko Vuokko
 
Corporate spiritin aamiaistilaisuus 12.2.13 tutkitusta tiedosta tulosten hyöd...
Corporate spiritin aamiaistilaisuus 12.2.13 tutkitusta tiedosta tulosten hyöd...Corporate spiritin aamiaistilaisuus 12.2.13 tutkitusta tiedosta tulosten hyöd...
Corporate spiritin aamiaistilaisuus 12.2.13 tutkitusta tiedosta tulosten hyöd...
Corporate Spirit Ltd
 
Johtamisen laadun mittaaminen omistajan ja hallituksen näkökulmasta julkinen
Johtamisen laadun mittaaminen omistajan ja hallituksen näkökulmasta julkinenJohtamisen laadun mittaaminen omistajan ja hallituksen näkökulmasta julkinen
Johtamisen laadun mittaaminen omistajan ja hallituksen näkökulmasta julkinenCorporate Spirit Ltd
 
Digitaalisesti sinun - Digivallankumous nopeasti ja joustavasti - Knowit - To...
Digitaalisesti sinun - Digivallankumous nopeasti ja joustavasti - Knowit - To...Digitaalisesti sinun - Digivallankumous nopeasti ja joustavasti - Knowit - To...
Digitaalisesti sinun - Digivallankumous nopeasti ja joustavasti - Knowit - To...
Tony Virtanen
 
Avoin data ja HRI -infotilaisuus 9.9.2015
Avoin data ja HRI -infotilaisuus 9.9.2015Avoin data ja HRI -infotilaisuus 9.9.2015
Avoin data ja HRI -infotilaisuus 9.9.2015
Helsinki Region Infoshare
 
Henry david
Henry davidHenry david
Henry david
the12thseahawk
 
Sensoridatan ja liiketoiminnan tulevaisuus
Sensoridatan ja liiketoiminnan tulevaisuusSensoridatan ja liiketoiminnan tulevaisuus
Sensoridatan ja liiketoiminnan tulevaisuus
Niko Vuokko
 
Principis dels nous medis de comunicació
Principis dels nous medis de comunicacióPrincipis dels nous medis de comunicació
Principis dels nous medis de comunicació
paulafw
 
Sensor Data in Business
Sensor Data in BusinessSensor Data in Business
Sensor Data in Business
Niko Vuokko
 
Metrics @ App Academy
Metrics @ App AcademyMetrics @ App Academy
Metrics @ App Academy
Niko Vuokko
 
Acer v17 session speciale web-groupe7
Acer v17 session speciale web-groupe7Acer v17 session speciale web-groupe7
Acer v17 session speciale web-groupe7
Jeremie SZTEMBERG
 
Analytics in business
Analytics in businessAnalytics in business
Analytics in business
Niko Vuokko
 
Reaaliaikainen Business Intelligence - WTF
Reaaliaikainen Business Intelligence - WTFReaaliaikainen Business Intelligence - WTF
Reaaliaikainen Business Intelligence - WTF
Mikko Muurinen
 
Analytiikka toiminnan kehittämisessä
Analytiikka toiminnan kehittämisessäAnalytiikka toiminnan kehittämisessä
Analytiikka toiminnan kehittämisessä
Jari Jussila
 
Tutustuminen data-analytiikan ja big datan maailmaan
Tutustuminen data-analytiikan ja big datan maailmaanTutustuminen data-analytiikan ja big datan maailmaan
Tutustuminen data-analytiikan ja big datan maailmaan
Jari Jussila
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Niko Vuokko
 
Information Security Management
Information Security ManagementInformation Security Management
Information Security Management
Novi Research Center
 
Drones in real use
Drones in real useDrones in real use
Drones in real use
Niko Vuokko
 
Economia politica
Economia politicaEconomia politica
Economia politica
Richard Bautista Mamani
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
Bernard Marr
 

Viewers also liked (20)

Analytiikka bisneksessä
Analytiikka bisneksessäAnalytiikka bisneksessä
Analytiikka bisneksessä
 
Corporate spiritin aamiaistilaisuus 12.2.13 tutkitusta tiedosta tulosten hyöd...
Corporate spiritin aamiaistilaisuus 12.2.13 tutkitusta tiedosta tulosten hyöd...Corporate spiritin aamiaistilaisuus 12.2.13 tutkitusta tiedosta tulosten hyöd...
Corporate spiritin aamiaistilaisuus 12.2.13 tutkitusta tiedosta tulosten hyöd...
 
Johtamisen laadun mittaaminen omistajan ja hallituksen näkökulmasta julkinen
Johtamisen laadun mittaaminen omistajan ja hallituksen näkökulmasta julkinenJohtamisen laadun mittaaminen omistajan ja hallituksen näkökulmasta julkinen
Johtamisen laadun mittaaminen omistajan ja hallituksen näkökulmasta julkinen
 
Digitaalisesti sinun - Digivallankumous nopeasti ja joustavasti - Knowit - To...
Digitaalisesti sinun - Digivallankumous nopeasti ja joustavasti - Knowit - To...Digitaalisesti sinun - Digivallankumous nopeasti ja joustavasti - Knowit - To...
Digitaalisesti sinun - Digivallankumous nopeasti ja joustavasti - Knowit - To...
 
Avoin data ja HRI -infotilaisuus 9.9.2015
Avoin data ja HRI -infotilaisuus 9.9.2015Avoin data ja HRI -infotilaisuus 9.9.2015
Avoin data ja HRI -infotilaisuus 9.9.2015
 
Henry david
Henry davidHenry david
Henry david
 
Sensoridatan ja liiketoiminnan tulevaisuus
Sensoridatan ja liiketoiminnan tulevaisuusSensoridatan ja liiketoiminnan tulevaisuus
Sensoridatan ja liiketoiminnan tulevaisuus
 
Principis dels nous medis de comunicació
Principis dels nous medis de comunicacióPrincipis dels nous medis de comunicació
Principis dels nous medis de comunicació
 
Sensor Data in Business
Sensor Data in BusinessSensor Data in Business
Sensor Data in Business
 
Metrics @ App Academy
Metrics @ App AcademyMetrics @ App Academy
Metrics @ App Academy
 
Acer v17 session speciale web-groupe7
Acer v17 session speciale web-groupe7Acer v17 session speciale web-groupe7
Acer v17 session speciale web-groupe7
 
Analytics in business
Analytics in businessAnalytics in business
Analytics in business
 
Reaaliaikainen Business Intelligence - WTF
Reaaliaikainen Business Intelligence - WTFReaaliaikainen Business Intelligence - WTF
Reaaliaikainen Business Intelligence - WTF
 
Analytiikka toiminnan kehittämisessä
Analytiikka toiminnan kehittämisessäAnalytiikka toiminnan kehittämisessä
Analytiikka toiminnan kehittämisessä
 
Tutustuminen data-analytiikan ja big datan maailmaan
Tutustuminen data-analytiikan ja big datan maailmaanTutustuminen data-analytiikan ja big datan maailmaan
Tutustuminen data-analytiikan ja big datan maailmaan
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Information Security Management
Information Security ManagementInformation Security Management
Information Security Management
 
Drones in real use
Drones in real useDrones in real use
Drones in real use
 
Economia politica
Economia politicaEconomia politica
Economia politica
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 

Similar to Big Data Rampage

Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
larsgeorge
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Roi Blanco
 
Top BI trends and predictions for 2017
Top BI trends and predictions for 2017Top BI trends and predictions for 2017
Top BI trends and predictions for 2017
Panorama Software
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
Zohar Elkayam
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
Caserta
 
Big Data & the importance of Data Science
Big Data & the importance of Data ScienceBig Data & the importance of Data Science
Big Data & the importance of Data Science
Wim Van Leuven
 
What Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will WinWhat Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will Win
BigDataCloud
 
Games Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - PlumbeeGames Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - Plumbee
GIAF
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
S P Sajjan
 
Big Data Boom
Big Data BoomBig Data Boom
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
Dr Pradhan PL Pradhan
 
Big data pipelines
Big data pipelinesBig data pipelines
Big data pipelines
Vivek Aanand Ganesan
 
bigdata.pptx
bigdata.pptxbigdata.pptx
bigdata.pptx
VIJAYAPRABAP
 
Transform from database professional to a Big Data architect
Transform from database professional to a Big Data architectTransform from database professional to a Big Data architect
Transform from database professional to a Big Data architect
Saurabh K. Gupta
 
The Death of the Star Schema
The Death of the Star SchemaThe Death of the Star Schema
The Death of the Star Schema
DATAVERSITY
 
Artificial Intelligence and the Data Center
Artificial Intelligence and the Data CenterArtificial Intelligence and the Data Center
Artificial Intelligence and the Data Center
sflaig
 
bigdata.pdf
bigdata.pdfbigdata.pdf
bigdata.pdf
AnjaliKumari301316
 
Big Data
Big DataBig Data
Big Data
Mahesh Bmn
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Denodo
 
Correlation does not mean causation
Correlation does not mean causationCorrelation does not mean causation
Correlation does not mean causation
Peter Varhol
 

Similar to Big Data Rampage (20)

Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Top BI trends and predictions for 2017
Top BI trends and predictions for 2017Top BI trends and predictions for 2017
Top BI trends and predictions for 2017
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Big Data & the importance of Data Science
Big Data & the importance of Data ScienceBig Data & the importance of Data Science
Big Data & the importance of Data Science
 
What Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will WinWhat Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will Win
 
Games Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - PlumbeeGames Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - Plumbee
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
Big Data Boom
Big Data BoomBig Data Boom
Big Data Boom
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
 
Big data pipelines
Big data pipelinesBig data pipelines
Big data pipelines
 
bigdata.pptx
bigdata.pptxbigdata.pptx
bigdata.pptx
 
Transform from database professional to a Big Data architect
Transform from database professional to a Big Data architectTransform from database professional to a Big Data architect
Transform from database professional to a Big Data architect
 
The Death of the Star Schema
The Death of the Star SchemaThe Death of the Star Schema
The Death of the Star Schema
 
Artificial Intelligence and the Data Center
Artificial Intelligence and the Data CenterArtificial Intelligence and the Data Center
Artificial Intelligence and the Data Center
 
bigdata.pdf
bigdata.pdfbigdata.pdf
bigdata.pdf
 
Big Data
Big DataBig Data
Big Data
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
 
Correlation does not mean causation
Correlation does not mean causationCorrelation does not mean causation
Correlation does not mean causation
 

Recently uploaded

inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
Fwdays
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
AlexanderRichford
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
zjhamm304
 
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
ScyllaDB
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptxAI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
Sunil Jagani
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
UiPathCommunity
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
ScyllaDB
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
leebarnesutopia
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!
Tobias Schneck
 

Recently uploaded (20)

inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
 
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...
 
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...QA or the Highway - Component Testing: Bridging the gap between frontend appl...
QA or the Highway - Component Testing: Bridging the gap between frontend appl...
 
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptxAI in the Workplace Reskilling, Upskilling, and Future Work.pptx
AI in the Workplace Reskilling, Upskilling, and Future Work.pptx
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!Containers & AI - Beauty and the Beast!?!
Containers & AI - Beauty and the Beast!?!
 

Big Data Rampage

  • 1. Big Data Rampage ! NIKO VUOKKO 13 MAY 2013, HIIT SEMINAR
  • 3. About that data of yours… • Researchers generally live in a nice utopia where data just works * *Yes, you do munge it for days, that’s nice Reality check 3
  • 4. What if you suddenly notice that there’s • … corrupted JSON/XML/whatever • … corrupted ids • … transient ids • … 5 different transient ids • … text in number fields • … new fields • … disappeared fields • … fields whose meaning just changed • … but you have no idea of the new definition • … all of these, regularly, without forward notice • … and the bad data is coming at you at 1 GB per hour • … and yours or someone else’s business depends on the data 4 You Garbage Great insights
  • 5. The data • Enriched by many operationally attainable sources --> varying schema and complicated ID soup • Developed by frontline instead of IT waterfall --> faster process, but volatile data definition • Data scientists often requires access to more data --> further risks of lapses • Big and streaming in --> risks of discontinuity 5
  • 6. The Big Data PLEASE DON’T SHOOT ME FOR USING THE TERM 6
  • 7. What is big? Human-generated • 5K tweets / s • 25K events / s from a mobile game (that’s 200 GB / day) • 40K Google searches / s Machine-generated • 5M quotes / s in the US options market • 120 MB / s of diagnostics from a single gas turbine • 1 PB / s peaking from CERN LHC 7
  • 8. What will be big? • Human-generated data will get more detailed • … but won’t grow much faster than the userbase • It will become small eventually • Machine-generated data will grow by the Moore’s law • … and it’s already massive 8
  • 9. How many of you consider this scale? • Why not ? • We already understand CPU and memory intensive problems • But the new world out there is data intensive • How can research stay in touch with change and stay relevant? 9
  • 11. Software Architectures • Single thread performance and disk IO hitting a wall • How do learning algorithms scale out of this corner ? • Stochastic methods • Ensembles • Online learning 11
  • 12. Databases 1 • In memory: MongoDB, Exasol, Redis • On disk (single/sharded): MySQL, PostgreSQL • On data warehouse:TeraData, DB2, Oracle • Distributed: HDFS, Cassandra, Riak • Cloud: S3, Azure, GCE, OpenStack 12 Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data
  • 13. Databases 2 • Good old OLTP • Analytic • Key-value stores • Document stores • HDFS • What is the best choice for this job ? 13
  • 14. Data Structures and Algorithms • Transforming data is expensive --> play safe with data structures • Normalization dilemma • Algorithms must tolerate the volatile nature of data • Data drift, errors, missing values, outliers • Models need to be explanatory • Attention to complexity • The usual obvious (CPU, memory, disk scans & seeks) • Iterations • Model size: What is an example of this? 14
  • 15. Real-time Systems What is real-time? Very different requirements: • Analyst: “What’s the user count today? By source? Now? From France?” • Sysadmin: “Network traffic up 5x in 5 seconds!What’s going on?” • Google: “Make a bid for these placements.You have 50 ms” 15
  • 16. User Interfaces • Operations or not, visualization is critical for acceptance • From business concept to implementation • What information do these users want to see ? • How does this information support decision making ? • How to visualize it with clarity yet powerfully ? 16
  • 17. Significance Testing • Data-driven actions must be backed by numbers • Early analytics glazed over significance • Executive: “Can I trust these numbers? Is my decision justified?” • Systems must act conservatively • Trust is built slowly, but lost quickly • Data solutions must not screw up 17
  • 18. Modeling Information Business Systems • Understanding business and how to improve it with data •  : business problem  data solution • The most important quality of a data scientist 18
  • 20. Hand-written Turing Machine vs Excel • Average business has tons of low-hanging data fruit • Developing and automating all that takes years (and years) • No use for “advanced” stuff without visibility to the underlying • There is no shortcut • The organization itself needs to mature 20
  • 21. Supervised vs Unsupervised • Decide purpose of analysis now or later ? • Most often the need is already formulated • Here’s a standard clustering of human behavior • Power laws will screw things up 21
  • 22. Ad-hoc vs Operations • Operative data algorithms run day and night without supervision • Can produce massive leverage and ROI to a business • … but they are crazy hard to develop • Ad hoc analysis can employ all the cool stuff from last month’s JMLR • … but they can’t scale • … and 90 % of effort goes to communication and visualization 22
  • 24. State snapshots • User actions modify the current state in an OLTP • Single actions go to offline audit log for re-running • Data algorithms need to export and import data • Things are run in batches • What data used to be (and still often is) 24 Events Snapshot Snapshot Snapshot
  • 25. Data Warehouse • Additional endpoint specialized for analytics • Can run surprisingly many algorithms • … because the speed is so worth the effort 25
  • 26. Cloud • “Scalable SOA for computation, networking and storage” • Really all about strict APIs • Service dog wagging the infrastructure tail • Public cloud very competitive for the small guys • Hybrid clouds increasingly replace enterprise systems 26
  • 27. Event data • Event stream itself becomes first class citizen and master-labeled • Needs novel storage • Needs novel processing • Data scientists beware! Sugar high imminent! 27
  • 28. Stream processing • New data is coming in all the time • Process it online • Data becomes somewhat disposable • “Why bother with month old data when there’s too much of it anyways ?” 28
  • 29. Iterative processing • Always been the problem with large data • Keeping state in memory necessary, but hard • Spark doesn’t solve this, but makes it less painful • Common fix: don’t do iterations 29
  • 30. Hadoop the Hairy Framework • HDFS, ZooKeeper, MapReduce, Hive, Pig, Yarn, Flume, Mahout, Bigtop, Oozie, Hue, HCatalog, Avro,Whirr, Sqoop, Impala, DataFu, … • Premise of insanely large and/or unstructured data • You probably don’t need it 30
  • 31. Will Hadoop replace the Data Warehouse? Separate concepts: HadoopThe Framework vs. MapReduce • MapReduce suited for totally different tasks • Hadoop can host a data warehouse • … but it won’t be any easier or quicker to develop 31
  • 33. What does Big Data mean for a business? • Answers … a lot more answers • Better, more reliable decision making • Treating customers as individuals instead of segments • How to design processes (both business and social) to employ data? 33
  • 35. Thank you! • Always eager to talk about this stuff, feel free to contact ! • Now it’s time for lots of questions ! • niko.vuokko@gmail.com • linkedin.com/in/nikovuokko • @nikovuokko 35