Big Data Rampage

Big Data Rampage !
NIKO VUOKKO
13 MAY 2013, HIIT SEMINAR

About that data of yours…
• Researchers generally live in a nice utopia where data just works *
*Yes, you do munge it for days, that’s nice
Reality
check
3

What if you suddenly notice that there’s
• … corrupted JSON/XML/whatever
• … corrupted ids
• … transient ids
• … 5 different transient ids
• … text in number fields
• … new fields
• … disappeared fields
• … fields whose meaning just changed
• … but you have no idea of the new definition
• … all of these, regularly, without forward notice
• … and the bad data is coming at you at 1 GB per hour
• … and yours or someone else’s business depends on the data
4
You
Garbage Great insights

The data
• Enriched by many operationally attainable sources
--> varying schema and complicated ID soup
• Developed by frontline instead of IT waterfall
--> faster process, but volatile data definition
• Data scientists often requires access to more data
--> further risks of lapses
• Big and streaming in
--> risks of discontinuity
5

The Big Data
PLEASE DON’T SHOOT ME FOR USING THE TERM
6

What is big?
Human-generated
• 5K tweets / s
• 25K events / s from a mobile game (that’s 200 GB / day)
• 40K Google searches / s
Machine-generated
• 5M quotes / s in the US options market
• 120 MB / s of diagnostics from a single gas turbine
• 1 PB / s peaking from CERN LHC
7

What will be big?
• Human-generated data will get more detailed
• … but won’t grow much faster than the userbase
• It will become small eventually
• Machine-generated data will grow by the Moore’s law
• … and it’s already massive
8

How many of you consider this scale?
• Why not ?
• We already understand CPU and memory intensive problems
• But the new world out there is data intensive
• How can research stay in touch with change and stay relevant?
9

The Curriculum
RETROFITTING CS STUDIES
10

Software Architectures
• Single thread performance and disk IO hitting a wall
• How do learning algorithms scale out of this corner ?
• Stochastic methods
• Ensembles
• Online learning
11

Databases 1
• In memory: MongoDB, Exasol, Redis
• On disk (single/sharded): MySQL, PostgreSQL
• On data warehouse:TeraData, DB2, Oracle
• Distributed: HDFS, Cassandra, Riak
• Cloud: S3, Azure, GCE, OpenStack
12
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data

Databases 2
• Good old OLTP
• Analytic
• Key-value stores
• Document stores
• HDFS
• What is the best choice for this job ?
13

Data Structures and Algorithms
• Transforming data is expensive --> play safe with data structures
• Normalization dilemma
• Algorithms must tolerate the volatile nature of data
• Data drift, errors, missing values, outliers
• Models need to be explanatory
• Attention to complexity
• The usual obvious (CPU, memory, disk scans & seeks)
• Iterations
• Model size: What is an example of this?
14

Real-time Systems
What is real-time?
Very different requirements:
• Analyst: “What’s the user count today? By source? Now? From France?”
• Sysadmin: “Network traffic up 5x in 5 seconds!What’s going on?”
• Google: “Make a bid for these placements.You have 50 ms”
15

User Interfaces
• Operations or not, visualization is critical for acceptance
• From business concept to implementation
• What information do these users want to see ?
• How does this information support decision making ?
• How to visualize it with clarity yet powerfully ?
16

Significance Testing
• Data-driven actions must be backed by numbers
• Early analytics glazed over significance
• Executive: “Can I trust these numbers? Is my decision justified?”
• Systems must act conservatively
• Trust is built slowly, but lost quickly
• Data solutions must not screw up
17

Modeling Information Business Systems
• Understanding business and how to improve it with data
•  : business problem  data solution
• The most important quality of a data scientist
18

Hand-written Turing Machine vs Excel
• Average business has tons of low-hanging data fruit
• Developing and automating all that takes years (and years)
• No use for “advanced” stuff without visibility to the underlying
• There is no shortcut
• The organization itself needs to mature
20

Supervised vs Unsupervised
• Decide purpose of analysis now or later ?
• Most often the need is already formulated
• Here’s a standard clustering of human behavior
• Power laws will screw things up
21

Ad-hoc vs Operations
• Operative data algorithms run day and night without supervision
• Can produce massive leverage and ROI to a business
• … but they are crazy hard to develop
• Ad hoc analysis can employ all the cool stuff from last month’s JMLR
• … but they can’t scale
• … and 90 % of effort goes to communication and visualization
22

State snapshots
• User actions modify the current state in an OLTP
• Single actions go to offline audit log for re-running
• Data algorithms need to export and import data
• Things are run in batches
• What data used to be (and still often is)
24
Events
Snapshot
Snapshot
Snapshot

Data Warehouse
• Additional endpoint specialized for analytics
• Can run surprisingly many algorithms
• … because the speed is so worth the effort
25

Cloud
• “Scalable SOA for computation, networking and storage”
• Really all about strict APIs
• Service dog wagging the infrastructure tail
• Public cloud very competitive for the small guys
• Hybrid clouds increasingly replace enterprise systems
26

Event data
• Event stream itself becomes first class citizen and master-labeled
• Needs novel storage
• Needs novel processing
• Data scientists beware! Sugar high imminent!
27

Stream processing
• New data is coming in all the time
• Process it online
• Data becomes somewhat disposable
• “Why bother with month old data when there’s too much of it anyways ?”
28

Iterative processing
• Always been the problem with large data
• Keeping state in memory necessary, but hard
• Spark doesn’t solve this, but makes it less painful
• Common fix: don’t do iterations
29

Hadoop the Hairy Framework
• HDFS, ZooKeeper, MapReduce, Hive, Pig,
Yarn, Flume, Mahout, Bigtop, Oozie, Hue,
HCatalog, Avro,Whirr, Sqoop, Impala, DataFu, …
• Premise of insanely large and/or unstructured data
• You probably don’t need it
30

Will Hadoop replace the Data Warehouse?
Separate concepts: HadoopThe Framework vs. MapReduce
• MapReduce suited for totally different tasks
• Hadoop can host a data warehouse
• … but it won’t be any easier or quicker to develop
31

What does Big Data mean for a business?
• Answers … a lot more answers
• Better, more reliable decision making
• Treating customers as individuals instead of segments
• How to design processes (both business and social) to employ data?
33

Data-driven decision making
34

Thank you!
• Always eager to talk about this stuff, feel free to contact !
• Now it’s time for lots of questions !
• niko.vuokko@gmail.com
• linkedin.com/in/nikovuokko
• @nikovuokko
35

Big Data Rampage

More Related Content

What's hot

Viewers also liked

Similar to Big Data Rampage

Recently uploaded

Big Data Rampage