Big Data Rampage !
NIKO VUOKKO
13 MAY 2013, HIIT SEMINAR
The data
2
About that data of yours…
• Researchers generally live in a nice utopia where data just works *
*Yes, you do munge it for days, that’s nice
Reality
check
3
What if you suddenly notice that there’s
• … corrupted JSON/XML/whatever
• … corrupted ids
• … transient ids
• … 5 different transient ids
• … text in number fields
• … new fields
• … disappeared fields
• … fields whose meaning just changed
• … but you have no idea of the new definition
• … all of these, regularly, without forward notice
• … and the bad data is coming at you at 1 GB per hour
• … and yours or someone else’s business depends on the data
4
You
Garbage Great insights
The data
• Enriched by many operationally attainable sources
--> varying schema and complicated ID soup
• Developed by frontline instead of IT waterfall
--> faster process, but volatile data definition
• Data scientists often requires access to more data
--> further risks of lapses
• Big and streaming in
--> risks of discontinuity
5
The Big Data
PLEASE DON’T SHOOT ME FOR USING THE TERM
6
What is big?
Human-generated
• 5K tweets / s
• 25K events / s from a mobile game (that’s 200 GB / day)
• 40K Google searches / s
Machine-generated
• 5M quotes / s in the US options market
• 120 MB / s of diagnostics from a single gas turbine
• 1 PB / s peaking from CERN LHC
7
What will be big?
• Human-generated data will get more detailed
• … but won’t grow much faster than the userbase
• It will become small eventually
• Machine-generated data will grow by the Moore’s law
• … and it’s already massive
8
How many of you consider this scale?
• Why not ?
• We already understand CPU and memory intensive problems
• But the new world out there is data intensive
• How can research stay in touch with change and stay relevant?
9
The Curriculum
RETROFITTING CS STUDIES
10
Software Architectures
• Single thread performance and disk IO hitting a wall
• How do learning algorithms scale out of this corner ?
• Stochastic methods
• Ensembles
• Online learning
11
Databases 1
• In memory: MongoDB, Exasol, Redis
• On disk (single/sharded): MySQL, PostgreSQL
• On data warehouse:TeraData, DB2, Oracle
• Distributed: HDFS, Cassandra, Riak
• Cloud: S3, Azure, GCE, OpenStack
12
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Databases 2
• Good old OLTP
• Analytic
• Key-value stores
• Document stores
• HDFS
• What is the best choice for this job ?
13
Data Structures and Algorithms
• Transforming data is expensive --> play safe with data structures
• Normalization dilemma
• Algorithms must tolerate the volatile nature of data
• Data drift, errors, missing values, outliers
• Models need to be explanatory
• Attention to complexity
• The usual obvious (CPU, memory, disk scans & seeks)
• Iterations
• Model size: What is an example of this?
14
Real-time Systems
What is real-time?
Very different requirements:
• Analyst: “What’s the user count today? By source? Now? From France?”
• Sysadmin: “Network traffic up 5x in 5 seconds!What’s going on?”
• Google: “Make a bid for these placements.You have 50 ms”
15
User Interfaces
• Operations or not, visualization is critical for acceptance
• From business concept to implementation
• What information do these users want to see ?
• How does this information support decision making ?
• How to visualize it with clarity yet powerfully ?
16
Significance Testing
• Data-driven actions must be backed by numbers
• Early analytics glazed over significance
• Executive: “Can I trust these numbers? Is my decision justified?”
• Systems must act conservatively
• Trust is built slowly, but lost quickly
• Data solutions must not screw up
17
Modeling Information Business Systems
• Understanding business and how to improve it with data
•  : business problem  data solution
• The most important quality of a data scientist
18
Contrasts
19
Hand-written Turing Machine vs Excel
• Average business has tons of low-hanging data fruit
• Developing and automating all that takes years (and years)
• No use for “advanced” stuff without visibility to the underlying
• There is no shortcut
• The organization itself needs to mature
20
Supervised vs Unsupervised
• Decide purpose of analysis now or later ?
• Most often the need is already formulated
• Here’s a standard clustering of human behavior
• Power laws will screw things up
21
Ad-hoc vs Operations
• Operative data algorithms run day and night without supervision
• Can produce massive leverage and ROI to a business
• … but they are crazy hard to develop
• Ad hoc analysis can employ all the cool stuff from last month’s JMLR
• … but they can’t scale
• … and 90 % of effort goes to communication and visualization
22
Computation Models
23
State snapshots
• User actions modify the current state in an OLTP
• Single actions go to offline audit log for re-running
• Data algorithms need to export and import data
• Things are run in batches
• What data used to be (and still often is)
24
Events
Snapshot
Snapshot
Snapshot
Data Warehouse
• Additional endpoint specialized for analytics
• Can run surprisingly many algorithms
• … because the speed is so worth the effort
25
Cloud
• “Scalable SOA for computation, networking and storage”
• Really all about strict APIs
• Service dog wagging the infrastructure tail
• Public cloud very competitive for the small guys
• Hybrid clouds increasingly replace enterprise systems
26
Event data
• Event stream itself becomes first class citizen and master-labeled
• Needs novel storage
• Needs novel processing
• Data scientists beware! Sugar high imminent!
27
Stream processing
• New data is coming in all the time
• Process it online
• Data becomes somewhat disposable
• “Why bother with month old data when there’s too much of it anyways ?”
28
Iterative processing
• Always been the problem with large data
• Keeping state in memory necessary, but hard
• Spark doesn’t solve this, but makes it less painful
• Common fix: don’t do iterations
29
Hadoop the Hairy Framework
• HDFS, ZooKeeper, MapReduce, Hive, Pig,
Yarn, Flume, Mahout, Bigtop, Oozie, Hue,
HCatalog, Avro,Whirr, Sqoop, Impala, DataFu, …
• Premise of insanely large and/or unstructured data
• You probably don’t need it
30
Will Hadoop replace the Data Warehouse?
Separate concepts: HadoopThe Framework vs. MapReduce
• MapReduce suited for totally different tasks
• Hadoop can host a data warehouse
• … but it won’t be any easier or quicker to develop
31
The Purpose
32
What does Big Data mean for a business?
• Answers … a lot more answers
• Better, more reliable decision making
• Treating customers as individuals instead of segments
• How to design processes (both business and social) to employ data?
33
Data-driven decision making
34
Thank you!
• Always eager to talk about this stuff, feel free to contact !
• Now it’s time for lots of questions !
• niko.vuokko@gmail.com
• linkedin.com/in/nikovuokko
• @nikovuokko
35

Big Data Rampage

  • 1.
    Big Data Rampage! NIKO VUOKKO 13 MAY 2013, HIIT SEMINAR
  • 2.
  • 3.
    About that dataof yours… • Researchers generally live in a nice utopia where data just works * *Yes, you do munge it for days, that’s nice Reality check 3
  • 4.
    What if yousuddenly notice that there’s • … corrupted JSON/XML/whatever • … corrupted ids • … transient ids • … 5 different transient ids • … text in number fields • … new fields • … disappeared fields • … fields whose meaning just changed • … but you have no idea of the new definition • … all of these, regularly, without forward notice • … and the bad data is coming at you at 1 GB per hour • … and yours or someone else’s business depends on the data 4 You Garbage Great insights
  • 5.
    The data • Enrichedby many operationally attainable sources --> varying schema and complicated ID soup • Developed by frontline instead of IT waterfall --> faster process, but volatile data definition • Data scientists often requires access to more data --> further risks of lapses • Big and streaming in --> risks of discontinuity 5
  • 6.
    The Big Data PLEASEDON’T SHOOT ME FOR USING THE TERM 6
  • 7.
    What is big? Human-generated •5K tweets / s • 25K events / s from a mobile game (that’s 200 GB / day) • 40K Google searches / s Machine-generated • 5M quotes / s in the US options market • 120 MB / s of diagnostics from a single gas turbine • 1 PB / s peaking from CERN LHC 7
  • 8.
    What will bebig? • Human-generated data will get more detailed • … but won’t grow much faster than the userbase • It will become small eventually • Machine-generated data will grow by the Moore’s law • … and it’s already massive 8
  • 9.
    How many ofyou consider this scale? • Why not ? • We already understand CPU and memory intensive problems • But the new world out there is data intensive • How can research stay in touch with change and stay relevant? 9
  • 10.
  • 11.
    Software Architectures • Singlethread performance and disk IO hitting a wall • How do learning algorithms scale out of this corner ? • Stochastic methods • Ensembles • Online learning 11
  • 12.
    Databases 1 • Inmemory: MongoDB, Exasol, Redis • On disk (single/sharded): MySQL, PostgreSQL • On data warehouse:TeraData, DB2, Oracle • Distributed: HDFS, Cassandra, Riak • Cloud: S3, Azure, GCE, OpenStack 12 Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data
  • 13.
    Databases 2 • Goodold OLTP • Analytic • Key-value stores • Document stores • HDFS • What is the best choice for this job ? 13
  • 14.
    Data Structures andAlgorithms • Transforming data is expensive --> play safe with data structures • Normalization dilemma • Algorithms must tolerate the volatile nature of data • Data drift, errors, missing values, outliers • Models need to be explanatory • Attention to complexity • The usual obvious (CPU, memory, disk scans & seeks) • Iterations • Model size: What is an example of this? 14
  • 15.
    Real-time Systems What isreal-time? Very different requirements: • Analyst: “What’s the user count today? By source? Now? From France?” • Sysadmin: “Network traffic up 5x in 5 seconds!What’s going on?” • Google: “Make a bid for these placements.You have 50 ms” 15
  • 16.
    User Interfaces • Operationsor not, visualization is critical for acceptance • From business concept to implementation • What information do these users want to see ? • How does this information support decision making ? • How to visualize it with clarity yet powerfully ? 16
  • 17.
    Significance Testing • Data-drivenactions must be backed by numbers • Early analytics glazed over significance • Executive: “Can I trust these numbers? Is my decision justified?” • Systems must act conservatively • Trust is built slowly, but lost quickly • Data solutions must not screw up 17
  • 18.
    Modeling Information BusinessSystems • Understanding business and how to improve it with data •  : business problem  data solution • The most important quality of a data scientist 18
  • 19.
  • 20.
    Hand-written Turing Machinevs Excel • Average business has tons of low-hanging data fruit • Developing and automating all that takes years (and years) • No use for “advanced” stuff without visibility to the underlying • There is no shortcut • The organization itself needs to mature 20
  • 21.
    Supervised vs Unsupervised •Decide purpose of analysis now or later ? • Most often the need is already formulated • Here’s a standard clustering of human behavior • Power laws will screw things up 21
  • 22.
    Ad-hoc vs Operations •Operative data algorithms run day and night without supervision • Can produce massive leverage and ROI to a business • … but they are crazy hard to develop • Ad hoc analysis can employ all the cool stuff from last month’s JMLR • … but they can’t scale • … and 90 % of effort goes to communication and visualization 22
  • 23.
  • 24.
    State snapshots • Useractions modify the current state in an OLTP • Single actions go to offline audit log for re-running • Data algorithms need to export and import data • Things are run in batches • What data used to be (and still often is) 24 Events Snapshot Snapshot Snapshot
  • 25.
    Data Warehouse • Additionalendpoint specialized for analytics • Can run surprisingly many algorithms • … because the speed is so worth the effort 25
  • 26.
    Cloud • “Scalable SOAfor computation, networking and storage” • Really all about strict APIs • Service dog wagging the infrastructure tail • Public cloud very competitive for the small guys • Hybrid clouds increasingly replace enterprise systems 26
  • 27.
    Event data • Eventstream itself becomes first class citizen and master-labeled • Needs novel storage • Needs novel processing • Data scientists beware! Sugar high imminent! 27
  • 28.
    Stream processing • Newdata is coming in all the time • Process it online • Data becomes somewhat disposable • “Why bother with month old data when there’s too much of it anyways ?” 28
  • 29.
    Iterative processing • Alwaysbeen the problem with large data • Keeping state in memory necessary, but hard • Spark doesn’t solve this, but makes it less painful • Common fix: don’t do iterations 29
  • 30.
    Hadoop the HairyFramework • HDFS, ZooKeeper, MapReduce, Hive, Pig, Yarn, Flume, Mahout, Bigtop, Oozie, Hue, HCatalog, Avro,Whirr, Sqoop, Impala, DataFu, … • Premise of insanely large and/or unstructured data • You probably don’t need it 30
  • 31.
    Will Hadoop replacethe Data Warehouse? Separate concepts: HadoopThe Framework vs. MapReduce • MapReduce suited for totally different tasks • Hadoop can host a data warehouse • … but it won’t be any easier or quicker to develop 31
  • 32.
  • 33.
    What does BigData mean for a business? • Answers … a lot more answers • Better, more reliable decision making • Treating customers as individuals instead of segments • How to design processes (both business and social) to employ data? 33
  • 34.
  • 35.
    Thank you! • Alwayseager to talk about this stuff, feel free to contact ! • Now it’s time for lots of questions ! • niko.vuokko@gmail.com • linkedin.com/in/nikovuokko • @nikovuokko 35