Big Data Rampage !NIKO VUOKKO13 MAY 2013, HIIT SEMINAR
The data2
About that data of yours…• Researchers generally live in a nice utopia where data just works **Yes, you do munge it for da...
What if you suddenly notice that there’s• … corrupted JSON/XML/whatever• … corrupted ids• … transient ids• … 5 different t...
The data• Enriched by many operationally attainable sources--> varying schema and complicated ID soup• Developed by frontl...
The Big DataPLEASE DON’T SHOOT ME FOR USING THE TERM6
What is big?Human-generated• 5K tweets / s• 25K events / s from a mobile game (that’s 200 GB / day)• 40K Google searches /...
What will be big?• Human-generated data will get more detailed• … but won’t grow much faster than the userbase• It will be...
How many of you consider this scale?• Why not ?• We already understand CPU and memory intensive problems• But the new worl...
The CurriculumRETROFITTING CS STUDIES10
Software Architectures• Single thread performance and disk IO hitting a wall• How do learning algorithms scale out of this...
Databases 1• In memory: MongoDB, Exasol, Redis• On disk (single/sharded): MySQL, PostgreSQL• On data warehouse:TeraData, D...
Databases 2• Good old OLTP• Analytic• Key-value stores• Document stores• HDFS• What is the best choice for this job ?13
Data Structures and Algorithms• Transforming data is expensive --> play safe with data structures• Normalization dilemma• ...
Real-time SystemsWhat is real-time?Very different requirements:• Analyst: “What’s the user count today? By source? Now? Fr...
User Interfaces• Operations or not, visualization is critical for acceptance• From business concept to implementation• Wha...
Significance Testing• Data-driven actions must be backed by numbers• Early analytics glazed over significance• Executive: ...
Modeling Information Business Systems• Understanding business and how to improve it with data•  : business problem  data...
Contrasts19
Hand-written Turing Machine vs Excel• Average business has tons of low-hanging data fruit• Developing and automating all t...
Supervised vs Unsupervised• Decide purpose of analysis now or later ?• Most often the need is already formulated• Here’s a...
Ad-hoc vs Operations• Operative data algorithms run day and night without supervision• Can produce massive leverage and RO...
Computation Models23
State snapshots• User actions modify the current state in an OLTP• Single actions go to offline audit log for re-running• ...
Data Warehouse• Additional endpoint specialized for analytics• Can run surprisingly many algorithms• … because the speed i...
Cloud• “Scalable SOA for computation, networking and storage”• Really all about strict APIs• Service dog wagging the infra...
Event data• Event stream itself becomes first class citizen and master-labeled• Needs novel storage• Needs novel processin...
Stream processing• New data is coming in all the time• Process it online• Data becomes somewhat disposable• “Why bother wi...
Iterative processing• Always been the problem with large data• Keeping state in memory necessary, but hard• Spark doesn’t ...
Hadoop the Hairy Framework• HDFS, ZooKeeper, MapReduce, Hive, Pig,Yarn, Flume, Mahout, Bigtop, Oozie, Hue,HCatalog, Avro,W...
Will Hadoop replace the Data Warehouse?Separate concepts: HadoopThe Framework vs. MapReduce• MapReduce suited for totally ...
The Purpose32
What does Big Data mean for a business?• Answers … a lot more answers• Better, more reliable decision making• Treating cus...
Data-driven decision making34
Thank you!• Always eager to talk about this stuff, feel free to contact !• Now it’s time for lots of questions !• niko.vuo...
Upcoming SlideShare
Loading in …5
×

Big Data Rampage

840 views

Published on

Presentation for the 45 min. + QA talk I gave at HIIT seminar on 13 May 2013 for local data science researchers.

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
840
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
12
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Big Data Rampage

  1. 1. Big Data Rampage !NIKO VUOKKO13 MAY 2013, HIIT SEMINAR
  2. 2. The data2
  3. 3. About that data of yours…• Researchers generally live in a nice utopia where data just works **Yes, you do munge it for days, that’s niceRealitycheck3
  4. 4. What if you suddenly notice that there’s• … corrupted JSON/XML/whatever• … corrupted ids• … transient ids• … 5 different transient ids• … text in number fields• … new fields• … disappeared fields• … fields whose meaning just changed• … but you have no idea of the new definition• … all of these, regularly, without forward notice• … and the bad data is coming at you at 1 GB per hour• … and yours or someone else’s business depends on the data4YouGarbage Great insights
  5. 5. The data• Enriched by many operationally attainable sources--> varying schema and complicated ID soup• Developed by frontline instead of IT waterfall--> faster process, but volatile data definition• Data scientists often requires access to more data--> further risks of lapses• Big and streaming in--> risks of discontinuity5
  6. 6. The Big DataPLEASE DON’T SHOOT ME FOR USING THE TERM6
  7. 7. What is big?Human-generated• 5K tweets / s• 25K events / s from a mobile game (that’s 200 GB / day)• 40K Google searches / sMachine-generated• 5M quotes / s in the US options market• 120 MB / s of diagnostics from a single gas turbine• 1 PB / s peaking from CERN LHC7
  8. 8. What will be big?• Human-generated data will get more detailed• … but won’t grow much faster than the userbase• It will become small eventually• Machine-generated data will grow by the Moore’s law• … and it’s already massive8
  9. 9. How many of you consider this scale?• Why not ?• We already understand CPU and memory intensive problems• But the new world out there is data intensive• How can research stay in touch with change and stay relevant?9
  10. 10. The CurriculumRETROFITTING CS STUDIES10
  11. 11. Software Architectures• Single thread performance and disk IO hitting a wall• How do learning algorithms scale out of this corner ?• Stochastic methods• Ensembles• Online learning11
  12. 12. Databases 1• In memory: MongoDB, Exasol, Redis• On disk (single/sharded): MySQL, PostgreSQL• On data warehouse:TeraData, DB2, Oracle• Distributed: HDFS, Cassandra, Riak• Cloud: S3, Azure, GCE, OpenStack12DataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataData
  13. 13. Databases 2• Good old OLTP• Analytic• Key-value stores• Document stores• HDFS• What is the best choice for this job ?13
  14. 14. Data Structures and Algorithms• Transforming data is expensive --> play safe with data structures• Normalization dilemma• Algorithms must tolerate the volatile nature of data• Data drift, errors, missing values, outliers• Models need to be explanatory• Attention to complexity• The usual obvious (CPU, memory, disk scans & seeks)• Iterations• Model size: What is an example of this?14
  15. 15. Real-time SystemsWhat is real-time?Very different requirements:• Analyst: “What’s the user count today? By source? Now? From France?”• Sysadmin: “Network traffic up 5x in 5 seconds!What’s going on?”• Google: “Make a bid for these placements.You have 50 ms”15
  16. 16. User Interfaces• Operations or not, visualization is critical for acceptance• From business concept to implementation• What information do these users want to see ?• How does this information support decision making ?• How to visualize it with clarity yet powerfully ?16
  17. 17. Significance Testing• Data-driven actions must be backed by numbers• Early analytics glazed over significance• Executive: “Can I trust these numbers? Is my decision justified?”• Systems must act conservatively• Trust is built slowly, but lost quickly• Data solutions must not screw up17
  18. 18. Modeling Information Business Systems• Understanding business and how to improve it with data•  : business problem  data solution• The most important quality of a data scientist18
  19. 19. Contrasts19
  20. 20. Hand-written Turing Machine vs Excel• Average business has tons of low-hanging data fruit• Developing and automating all that takes years (and years)• No use for “advanced” stuff without visibility to the underlying• There is no shortcut• The organization itself needs to mature20
  21. 21. Supervised vs Unsupervised• Decide purpose of analysis now or later ?• Most often the need is already formulated• Here’s a standard clustering of human behavior• Power laws will screw things up21
  22. 22. Ad-hoc vs Operations• Operative data algorithms run day and night without supervision• Can produce massive leverage and ROI to a business• … but they are crazy hard to develop• Ad hoc analysis can employ all the cool stuff from last month’s JMLR• … but they can’t scale• … and 90 % of effort goes to communication and visualization22
  23. 23. Computation Models23
  24. 24. State snapshots• User actions modify the current state in an OLTP• Single actions go to offline audit log for re-running• Data algorithms need to export and import data• Things are run in batches• What data used to be (and still often is)24EventsSnapshotSnapshotSnapshot
  25. 25. Data Warehouse• Additional endpoint specialized for analytics• Can run surprisingly many algorithms• … because the speed is so worth the effort25
  26. 26. Cloud• “Scalable SOA for computation, networking and storage”• Really all about strict APIs• Service dog wagging the infrastructure tail• Public cloud very competitive for the small guys• Hybrid clouds increasingly replace enterprise systems26
  27. 27. Event data• Event stream itself becomes first class citizen and master-labeled• Needs novel storage• Needs novel processing• Data scientists beware! Sugar high imminent!27
  28. 28. Stream processing• New data is coming in all the time• Process it online• Data becomes somewhat disposable• “Why bother with month old data when there’s too much of it anyways ?”28
  29. 29. Iterative processing• Always been the problem with large data• Keeping state in memory necessary, but hard• Spark doesn’t solve this, but makes it less painful• Common fix: don’t do iterations29
  30. 30. Hadoop the Hairy Framework• HDFS, ZooKeeper, MapReduce, Hive, Pig,Yarn, Flume, Mahout, Bigtop, Oozie, Hue,HCatalog, Avro,Whirr, Sqoop, Impala, DataFu, …• Premise of insanely large and/or unstructured data• You probably don’t need it30
  31. 31. Will Hadoop replace the Data Warehouse?Separate concepts: HadoopThe Framework vs. MapReduce• MapReduce suited for totally different tasks• Hadoop can host a data warehouse• … but it won’t be any easier or quicker to develop31
  32. 32. The Purpose32
  33. 33. What does Big Data mean for a business?• Answers … a lot more answers• Better, more reliable decision making• Treating customers as individuals instead of segments• How to design processes (both business and social) to employ data?33
  34. 34. Data-driven decision making34
  35. 35. Thank you!• Always eager to talk about this stuff, feel free to contact !• Now it’s time for lots of questions !• niko.vuokko@gmail.com• linkedin.com/in/nikovuokko• @nikovuokko35

×