Tuesday, June 8, 2010
Tuesday, June 8, 2010
BIG DATA
                                       The rise of the data scientist




                        http://flowingda...
Holidaycheck
                  Travel platform: review +
                  book

                  12+ countries (.de ... ...
Data @ HC
                                internet-driven            15 Gb Operational
                                com...
The I/O Bottleneck
                   “The problem is simple: Memory, Disk size and CPU and even
                 network ...
I/O Repercussions

                  Turn to memcache

                  Try out SSD

                  Try out asynchrono...
So what is Big Data anyway?
           “The term Big data from software engineering and computer science
         describe...
NoSQL = Not Only SQL
                            Trade-Offs, e.g. transactions, data loss
           e.g. Document Stores ...
Medium Data
         “With yesterday's scientific technology most businesses should be able to
                            ...
3 sexy skills of data geeks

                        “The sexy job in the next ten years will be statisticians… The abilit...
3 skills: statistics

         sentiment analysis      machine learning   natural language processing
                   r...
3 skills: visualization
                              Q: Are you hiring statisticians, visualization experts & data plumbe...
3 skills: data plumbing

           Glue languages: Python, Perl, regex, XSLT

                                           ...
More Data beats smart algorithms




                                       face recognition

                         spe...
Ethics of data

                  Black Hat vs. White Hat <=> Black Data vs. White data

                  White: Amazon f...
Take-Away & Discuss
                          “Don't throw away data if you don’t have to, because
                       ...
Upcoming SlideShare
Loading in …5
×

Big Data @ Bodensee Barcamp 2010

1,253 views

Published on

Big Data @ Bodensee Barcamp 2010

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,253
On SlideShare
0
From Embeds
0
Number of Embeds
289
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Does a 500 Gb stick exist? yes, this is a quiz, internet is allowed no cheating, no SSD drives
  • Not it doesn’t. Chinese fake. A bit better than this one. When will you think a 1 Tb USB stick will exist? Petabyte? We mostly believe in Moore’s law &amp; that’s a problem.
  • Big Data: what is it? Setup the systems. Data scientists: who are they? Hire the people. Discuss!
  • growing pains
  • The web is full of &amp;quot;data-driven apps.&amp;quot; We are one. But that does not make us “data scientists”Storage &amp; Analysis are separate things. : Operational vs. Analysis datastore
  • When designing systems, these days you run more and more into I/O bottlenecks.
  • NoSQL: document-stores, “Turn in your schema at the entrance”, trade-offs, MongoDB, Cassandra, NoSQL = Not ONLY SQL clickpaths question: describe data sizes in audience
  • Used to be: Big Oil. Big Telco. Big Banking. Big Pharma. BIG in Physics: LHC outputs 24 zettabytes / second. BIG in Genetics: several terabytes per sequencing experiment. Personal genome / Personalized medicine / less than 10 years ago human genome, now 1000 genomes project, SNPs (23andme) 10 &amp; 24 zeroes Illumina sequencer /
  • yesterday = BigTable, MapReduce, Clustering approx. 5 years old Let&apos;s face it: most businesses do not have the data needs ... Exceptions: Google / Facebook / Twitter. Take away: can you handle medium-data? What tech can be used? What kind of systems can I build? NoSQL.
  • The human factor: who do I hire? http://radar.oreilly.com/2010/06/what-is-data-science.html http://dataspora.com/blog/sexy-data-geeks/ Do you have a st atistician on board? Do you have a data vi sualization expert on board?Do you have a data plumber on board?
  • When all of the above fails: crowdsourcing? MTurk
  • Edward Tufte, Ben Fry Do you have a statistician on board? Do you have a data visualization expert on board?Do you have a data plumber on board?
  • Peter Norvig spelling corrector, machine translation, image recognition Phase shifts: dig out data that you thought didn’t exist: GayDar, Netflix
  • Project Gaydar: do you own yourself? Netflix competition: shreddingGoogle trading floor: buy more google stock!# Grey data23andMe:
  • Is that your data, or are you just happy to see me? How big is your data (Share)Who is using a NoSQL db? Share?Do you have statisticians? Visual experts? Data plumbe
  • Big Data @ Bodensee Barcamp 2010

    1. 1. Tuesday, June 8, 2010
    2. 2. Tuesday, June 8, 2010
    3. 3. BIG DATA The rise of the data scientist http://flowingdata.com/2009/06/04/rise-of-the-data-scientist/ Tuesday, June 8, 2010
    4. 4. Holidaycheck Travel platform: review + book 12+ countries (.de ... .cn) 30% growth / year, profitable Almost 1.5 mio hotel reviews 1.6 mio + pics Tuesday, June 8, 2010
    5. 5. Data @ HC internet-driven 15 Gb Operational company Data traditional: MVC/ 12 Gb logs / day 3-Tier/RDBMS/ caching 5 searches / second 50+ Apache instances My scientist friend: “That’s neat, but it’s not data science.” Tuesday, June 8, 2010
    6. 6. The I/O Bottleneck “The problem is simple: Memory, Disk size and CPU and even network performance continue to grow much faster than disk I/O performance.” 2004 to 2009 CPU: still following Moore's Law (transistor x2 every 18 months) Memory Bandwidth (Intel): 9.3x Disk Density (SATA): 8x Disk I/O: 0.8x Network speed: routers can easily saturate the fastest hard drives http://blogs.cisco.com/datacenter/comments/networking_delivering_more_by_exceeding_the_law_of_moore/ Tuesday, June 8, 2010
    7. 7. I/O Repercussions Turn to memcache Try out SSD Try out asynchronous writes (e.g. message queues) Try to solve/hack the I/O problem: Sharding, in-memory DB Our problems seem big, but are they really? Tuesday, June 8, 2010
    8. 8. So what is Big Data anyway? “The term Big data from software engineering and computer science describes datasets that grow so large that they become awkward to work with using on-hand database management tools” kilo to mega to giga to tera to peta to exa to zetta to yotta Tuesday, June 8, 2010
    9. 9. NoSQL = Not Only SQL Trade-Offs, e.g. transactions, data loss e.g. Document Stores (MongoDB) e.g. Key-Value Stores (MemcacheDB) e.g. Graph Databases (Neo4j) Map/Reduce algorithm Tuesday, June 8, 2010
    10. 10. Medium Data “With yesterday's scientific technology most businesses should be able to handle their data analysis needs.” HC: 12 Gb logfiles / day = medium data problem Solved (?) with: RDBMS + NoSQL (2006) Bigtable: A Distributed Storage System for Structured Data, Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber (2004) MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat Tuesday, June 8, 2010
    11. 11. 3 sexy skills of data geeks “The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it. Hal Valerian (Google)” http://dataspora.com/blog/sexy-data-geeks/ Tuesday, June 8, 2010
    12. 12. 3 skills: statistics sentiment analysis machine learning natural language processing recommendation engines good old-fashioned regression Tuesday, June 8, 2010
    13. 13. 3 skills: visualization Q: Are you hiring statisticians, visualization experts & data plumbers? Vs. TheOathMeal Edward Tufte, Ben Fry Tuesday, June 8, 2010
    14. 14. 3 skills: data plumbing Glue languages: Python, Perl, regex, XSLT Admin: setting up, maintaining clusters Affinity with OSS & *nix NoSQL = NoSchema = Transform Data /^([w!#$%&'*+-/=?^`{|}~]+.)*[w!#$%& '*+-/=?^`{|}~]+@((((([a-z0-9]{1}[a-z0-9-]{0,62}[a- z0-9]{1})|[a-z]).)+[a-z]{2,6})|(d{1,3}.){3}d{1,3}(:d{1,5})?)$/i Tuesday, June 8, 2010
    15. 15. More Data beats smart algorithms face recognition spelling correction machine translation http://videos.syntience.com/ai-meetups/peternorvig.html http://dataspora.com/blog/tipping-points-and-big-data/ Tuesday, June 8, 2010
    16. 16. Ethics of data Black Hat vs. White Hat <=> Black Data vs. White data White: Amazon free public datasets (e.g. human genome) Black: Scientific climate data (or the lack of PUBLIC data) Just like money, information flows to the least taxed location in a global world. Tuesday, June 8, 2010
    17. 17. Take-Away & Discuss “Don't throw away data if you don’t have to, because unlike material goods, data becomes more valuable the more of it is created. As a society, I don't think we understand this completely yet.” q: Who is using a NoSQL db? Share Stories? q: Do you know how much data you are q: Do you hire statisticians? throwing away? q: Do you hire visualization q: Any tips on introducing NoSQL in experts? companies? q: Share: how big is your data? q: Do you own your customer data or q: Do you own your analytics data? does Facebook? q: How are you exploiting q: Do you own your content or does asynchronicity? Google? q: Should information be regulated (privacy)? Can it? Tuesday, June 8, 2010

    ×