Your SlideShare is downloading. ×

Big Data Rampage


Published on

Presentation for the 45 min. + QA talk I gave at HIIT seminar on 13 May 2013 for local data science researchers.

Presentation for the 45 min. + QA talk I gave at HIIT seminar on 13 May 2013 for local data science researchers.

Published in: Technology, Business

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Big Data Rampage !NIKO VUOKKO13 MAY 2013, HIIT SEMINAR
  • 2. The data2
  • 3. About that data of yours…• Researchers generally live in a nice utopia where data just works **Yes, you do munge it for days, that’s niceRealitycheck3
  • 4. What if you suddenly notice that there’s• … corrupted JSON/XML/whatever• … corrupted ids• … transient ids• … 5 different transient ids• … text in number fields• … new fields• … disappeared fields• … fields whose meaning just changed• … but you have no idea of the new definition• … all of these, regularly, without forward notice• … and the bad data is coming at you at 1 GB per hour• … and yours or someone else’s business depends on the data4YouGarbage Great insights
  • 5. The data• Enriched by many operationally attainable sources--> varying schema and complicated ID soup• Developed by frontline instead of IT waterfall--> faster process, but volatile data definition• Data scientists often requires access to more data--> further risks of lapses• Big and streaming in--> risks of discontinuity5
  • 7. What is big?Human-generated• 5K tweets / s• 25K events / s from a mobile game (that’s 200 GB / day)• 40K Google searches / sMachine-generated• 5M quotes / s in the US options market• 120 MB / s of diagnostics from a single gas turbine• 1 PB / s peaking from CERN LHC7
  • 8. What will be big?• Human-generated data will get more detailed• … but won’t grow much faster than the userbase• It will become small eventually• Machine-generated data will grow by the Moore’s law• … and it’s already massive8
  • 9. How many of you consider this scale?• Why not ?• We already understand CPU and memory intensive problems• But the new world out there is data intensive• How can research stay in touch with change and stay relevant?9
  • 10. The CurriculumRETROFITTING CS STUDIES10
  • 11. Software Architectures• Single thread performance and disk IO hitting a wall• How do learning algorithms scale out of this corner ?• Stochastic methods• Ensembles• Online learning11
  • 12. Databases 1• In memory: MongoDB, Exasol, Redis• On disk (single/sharded): MySQL, PostgreSQL• On data warehouse:TeraData, DB2, Oracle• Distributed: HDFS, Cassandra, Riak• Cloud: S3, Azure, GCE, OpenStack12DataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataDataData
  • 13. Databases 2• Good old OLTP• Analytic• Key-value stores• Document stores• HDFS• What is the best choice for this job ?13
  • 14. Data Structures and Algorithms• Transforming data is expensive --> play safe with data structures• Normalization dilemma• Algorithms must tolerate the volatile nature of data• Data drift, errors, missing values, outliers• Models need to be explanatory• Attention to complexity• The usual obvious (CPU, memory, disk scans & seeks)• Iterations• Model size: What is an example of this?14
  • 15. Real-time SystemsWhat is real-time?Very different requirements:• Analyst: “What’s the user count today? By source? Now? From France?”• Sysadmin: “Network traffic up 5x in 5 seconds!What’s going on?”• Google: “Make a bid for these placements.You have 50 ms”15
  • 16. User Interfaces• Operations or not, visualization is critical for acceptance• From business concept to implementation• What information do these users want to see ?• How does this information support decision making ?• How to visualize it with clarity yet powerfully ?16
  • 17. Significance Testing• Data-driven actions must be backed by numbers• Early analytics glazed over significance• Executive: “Can I trust these numbers? Is my decision justified?”• Systems must act conservatively• Trust is built slowly, but lost quickly• Data solutions must not screw up17
  • 18. Modeling Information Business Systems• Understanding business and how to improve it with data•  : business problem  data solution• The most important quality of a data scientist18
  • 19. Contrasts19
  • 20. Hand-written Turing Machine vs Excel• Average business has tons of low-hanging data fruit• Developing and automating all that takes years (and years)• No use for “advanced” stuff without visibility to the underlying• There is no shortcut• The organization itself needs to mature20
  • 21. Supervised vs Unsupervised• Decide purpose of analysis now or later ?• Most often the need is already formulated• Here’s a standard clustering of human behavior• Power laws will screw things up21
  • 22. Ad-hoc vs Operations• Operative data algorithms run day and night without supervision• Can produce massive leverage and ROI to a business• … but they are crazy hard to develop• Ad hoc analysis can employ all the cool stuff from last month’s JMLR• … but they can’t scale• … and 90 % of effort goes to communication and visualization22
  • 23. Computation Models23
  • 24. State snapshots• User actions modify the current state in an OLTP• Single actions go to offline audit log for re-running• Data algorithms need to export and import data• Things are run in batches• What data used to be (and still often is)24EventsSnapshotSnapshotSnapshot
  • 25. Data Warehouse• Additional endpoint specialized for analytics• Can run surprisingly many algorithms• … because the speed is so worth the effort25
  • 26. Cloud• “Scalable SOA for computation, networking and storage”• Really all about strict APIs• Service dog wagging the infrastructure tail• Public cloud very competitive for the small guys• Hybrid clouds increasingly replace enterprise systems26
  • 27. Event data• Event stream itself becomes first class citizen and master-labeled• Needs novel storage• Needs novel processing• Data scientists beware! Sugar high imminent!27
  • 28. Stream processing• New data is coming in all the time• Process it online• Data becomes somewhat disposable• “Why bother with month old data when there’s too much of it anyways ?”28
  • 29. Iterative processing• Always been the problem with large data• Keeping state in memory necessary, but hard• Spark doesn’t solve this, but makes it less painful• Common fix: don’t do iterations29
  • 30. Hadoop the Hairy Framework• HDFS, ZooKeeper, MapReduce, Hive, Pig,Yarn, Flume, Mahout, Bigtop, Oozie, Hue,HCatalog, Avro,Whirr, Sqoop, Impala, DataFu, …• Premise of insanely large and/or unstructured data• You probably don’t need it30
  • 31. Will Hadoop replace the Data Warehouse?Separate concepts: HadoopThe Framework vs. MapReduce• MapReduce suited for totally different tasks• Hadoop can host a data warehouse• … but it won’t be any easier or quicker to develop31
  • 32. The Purpose32
  • 33. What does Big Data mean for a business?• Answers … a lot more answers• Better, more reliable decision making• Treating customers as individuals instead of segments• How to design processes (both business and social) to employ data?33
  • 34. Data-driven decision making34
  • 35. Thank you!• Always eager to talk about this stuff, feel free to contact !• Now it’s time for lots of questions !••• @nikovuokko35