SlideShare a Scribd company logo
Tuesday, June 8, 2010
Tuesday, June 8, 2010
BIG DATA
                                       The rise of the data scientist




                        http://flowingdata.com/2009/06/04/rise-of-the-data-scientist/
Tuesday, June 8, 2010
Holidaycheck
                  Travel platform: review +
                  book

                  12+ countries (.de ... .cn)

                  30% growth / year,
                  profitable

                  Almost 1.5 mio hotel reviews

                  1.6 mio + pics


Tuesday, June 8, 2010
Data @ HC
                                internet-driven            15 Gb Operational
                                company                    Data

                                traditional: MVC/          12 Gb logs / day
                                3-Tier/RDBMS/
                                caching                    5 searches /
                                                           second
                                50+ Apache
                                instances


                        My scientist friend: “That’s neat, but it’s not data science.”


Tuesday, June 8, 2010
The I/O Bottleneck
                   “The problem is simple: Memory, Disk size and CPU and even
                 network performance continue to grow much faster than disk I/O
                                          performance.”
                                              2004 to 2009

                                              CPU: still following Moore's Law (transistor x2 every 18
                                              months)

                                              Memory Bandwidth (Intel): 9.3x

                                              Disk Density (SATA): 8x

                                              Disk I/O: 0.8x

                                              Network speed: routers can easily saturate the fastest hard
                                              drives


                        http://blogs.cisco.com/datacenter/comments/networking_delivering_more_by_exceeding_the_law_of_moore/




Tuesday, June 8, 2010
I/O Repercussions

                  Turn to memcache

                  Try out SSD

                  Try out asynchronous writes (e.g. message queues)

                  Try to solve/hack the I/O problem: Sharding, in-memory DB

                  Our problems seem big, but are they really?



Tuesday, June 8, 2010
So what is Big Data anyway?
           “The term Big data from software engineering and computer science
         describes datasets that grow so large that they become awkward to work
                     with using on-hand database management tools”




                        kilo to mega to giga to tera to peta to exa to zetta to yotta

Tuesday, June 8, 2010
NoSQL = Not Only SQL
                            Trade-Offs, e.g. transactions, data loss
           e.g. Document Stores (MongoDB)       e.g. Key-Value Stores (MemcacheDB)
                        e.g. Graph Databases (Neo4j)       Map/Reduce algorithm




Tuesday, June 8, 2010
Medium Data
         “With yesterday's scientific technology most businesses should be able to
                             handle their data analysis needs.”


                            HC: 12 Gb logfiles / day = medium data problem


                                           Solved (?) with: RDBMS + NoSQL

                        (2006) Bigtable: A Distributed Storage System for Structured Data, Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson
                                 C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber


                           (2004) MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat




Tuesday, June 8, 2010
3 sexy skills of data geeks

                        “The sexy job in the next ten years will be statisticians… The ability
                        to take data—to be able to understand it, to process it, to extract
                        value from it, to visualize it, to communicate it. Hal Valerian (Google)”




                                                        http://dataspora.com/blog/sexy-data-geeks/


Tuesday, June 8, 2010
3 skills: statistics

         sentiment analysis      machine learning   natural language processing
                   recommendation engines   good old-fashioned regression




Tuesday, June 8, 2010
3 skills: visualization
                              Q: Are you hiring statisticians, visualization experts & data plumbers?




                                                                Vs.




                        TheOathMeal                                                 Edward Tufte, Ben Fry

Tuesday, June 8, 2010
3 skills: data plumbing

           Glue languages: Python, Perl, regex, XSLT

                                                 Admin: setting up, maintaining clusters

                             Affinity with OSS & *nix

                                                NoSQL = NoSchema = Transform Data


                        /^([w!#$%&'*+-/=?^`{|}~]+.)*[w!#$%&
                        '*+-/=?^`{|}~]+@((((([a-z0-9]{1}[a-z0-9-]{0,62}[a-
                        z0-9]{1})|[a-z]).)+[a-z]{2,6})|(d{1,3}.){3}d{1,3}(:d{1,5})?)$/i



Tuesday, June 8, 2010
More Data beats smart algorithms




                                       face recognition

                         spelling correction      machine translation


                             http://videos.syntience.com/ai-meetups/peternorvig.html
                               http://dataspora.com/blog/tipping-points-and-big-data/

Tuesday, June 8, 2010
Ethics of data

                  Black Hat vs. White Hat <=> Black Data vs. White data

                  White: Amazon free public datasets (e.g. human genome)

                  Black: Scientific climate data (or the lack of PUBLIC data)

                  Just like money, information flows to the least taxed location in a
                  global world.



Tuesday, June 8, 2010
Take-Away & Discuss
                          “Don't throw away data if you don’t have to, because
                         unlike material goods, data becomes more valuable the
                           more of it is created. As a society, I don't think we
                                    understand this completely yet.”
                                          q: Who is using a NoSQL db?
                                                   Share Stories?
                                                                  q: Do you know how much data you are
                              q: Do you hire statisticians?                   throwing away?

                                       q: Do you hire visualization                 q: Any tips on introducing NoSQL in
                                                experts?                                         companies?
                                                             q: Share: how big is your data?

                                  q: Do you own your customer data or                  q: Do you own your analytics data?
                                             does Facebook?
                                                                             q: How are you exploiting
                        q: Do you own your content or does                        asynchronicity?
                                     Google?
                                                      q: Should information be regulated
                                                               (privacy)? Can it?


Tuesday, June 8, 2010

More Related Content

Similar to Big Data @ Bodensee Barcamp 2010

Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data Science London
 
There's no such thing as big data
There's no such thing as big dataThere's no such thing as big data
There's no such thing as big data
Andrew Clegg
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
Semantic Web Company
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
c.titus.brown
 
Introduction to the FP7 CODE project @ BDBC
Introduction to the FP7 CODE project @ BDBCIntroduction to the FP7 CODE project @ BDBC
Introduction to the FP7 CODE project @ BDBC
Florian Stegmaier
 
Bigger than Any One: Solving Large Scale Data Problems with People and Machines
Bigger than Any One: Solving Large Scale Data Problems with People and MachinesBigger than Any One: Solving Large Scale Data Problems with People and Machines
Bigger than Any One: Solving Large Scale Data Problems with People and Machines
Tyler Bell
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
Attila Barta
 
Sample Paper.doc.doc
Sample Paper.doc.docSample Paper.doc.doc
Sample Paper.doc.doc
butest
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Dr. Sunil Kr. Pandey
 
Presentation_Final
Presentation_FinalPresentation_Final
Presentation_Final
Andrei Khurshudov
 
Mini-Training: DataViz, data-driven documents and D3.js
Mini-Training: DataViz, data-driven documents and D3.jsMini-Training: DataViz, data-driven documents and D3.js
Mini-Training: DataViz, data-driven documents and D3.js
Betclic Everest Group Tech Team
 
noSQL @ QCon SP
noSQL @ QCon SPnoSQL @ QCon SP
noSQL @ QCon SP
Alexandre Porcelli
 
Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big Analytics
Heiko Joerg Schick
 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
Thomas Rones
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
Kenny Daniel
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
IJMER
 
How to develop a data scientist – What business has requested v02
How to develop a data scientist – What business has requested v02How to develop a data scientist – What business has requested v02
How to develop a data scientist – What business has requested v02
Data Science London
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
Kelly Technologies
 
The Evolving Landscape of Data Engineering
The Evolving Landscape of Data EngineeringThe Evolving Landscape of Data Engineering
The Evolving Landscape of Data Engineering
Andrei Savu
 
[243] turning data into value
[243] turning data into value[243] turning data into value
[243] turning data into value
NAVER D2
 

Similar to Big Data @ Bodensee Barcamp 2010 (20)

Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
There's no such thing as big data
There's no such thing as big dataThere's no such thing as big data
There's no such thing as big data
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Introduction to the FP7 CODE project @ BDBC
Introduction to the FP7 CODE project @ BDBCIntroduction to the FP7 CODE project @ BDBC
Introduction to the FP7 CODE project @ BDBC
 
Bigger than Any One: Solving Large Scale Data Problems with People and Machines
Bigger than Any One: Solving Large Scale Data Problems with People and MachinesBigger than Any One: Solving Large Scale Data Problems with People and Machines
Bigger than Any One: Solving Large Scale Data Problems with People and Machines
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
Sample Paper.doc.doc
Sample Paper.doc.docSample Paper.doc.doc
Sample Paper.doc.doc
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
Presentation_Final
Presentation_FinalPresentation_Final
Presentation_Final
 
Mini-Training: DataViz, data-driven documents and D3.js
Mini-Training: DataViz, data-driven documents and D3.jsMini-Training: DataViz, data-driven documents and D3.js
Mini-Training: DataViz, data-driven documents and D3.js
 
noSQL @ QCon SP
noSQL @ QCon SPnoSQL @ QCon SP
noSQL @ QCon SP
 
Petascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big AnalyticsPetascale Analytics - The World of Big Data Requires Big Analytics
Petascale Analytics - The World of Big Data Requires Big Analytics
 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
 
How to develop a data scientist – What business has requested v02
How to develop a data scientist – What business has requested v02How to develop a data scientist – What business has requested v02
How to develop a data scientist – What business has requested v02
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
The Evolving Landscape of Data Engineering
The Evolving Landscape of Data EngineeringThe Evolving Landscape of Data Engineering
The Evolving Landscape of Data Engineering
 
[243] turning data into value
[243] turning data into value[243] turning data into value
[243] turning data into value
 

Recently uploaded

Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
saastr
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 

Recently uploaded (20)

Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
9 CEO's who hit $100m ARR Share Their Top Growth Tactics Nathan Latka, Founde...
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 

Big Data @ Bodensee Barcamp 2010

  • 3. BIG DATA The rise of the data scientist http://flowingdata.com/2009/06/04/rise-of-the-data-scientist/ Tuesday, June 8, 2010
  • 4. Holidaycheck Travel platform: review + book 12+ countries (.de ... .cn) 30% growth / year, profitable Almost 1.5 mio hotel reviews 1.6 mio + pics Tuesday, June 8, 2010
  • 5. Data @ HC internet-driven 15 Gb Operational company Data traditional: MVC/ 12 Gb logs / day 3-Tier/RDBMS/ caching 5 searches / second 50+ Apache instances My scientist friend: “That’s neat, but it’s not data science.” Tuesday, June 8, 2010
  • 6. The I/O Bottleneck “The problem is simple: Memory, Disk size and CPU and even network performance continue to grow much faster than disk I/O performance.” 2004 to 2009 CPU: still following Moore's Law (transistor x2 every 18 months) Memory Bandwidth (Intel): 9.3x Disk Density (SATA): 8x Disk I/O: 0.8x Network speed: routers can easily saturate the fastest hard drives http://blogs.cisco.com/datacenter/comments/networking_delivering_more_by_exceeding_the_law_of_moore/ Tuesday, June 8, 2010
  • 7. I/O Repercussions Turn to memcache Try out SSD Try out asynchronous writes (e.g. message queues) Try to solve/hack the I/O problem: Sharding, in-memory DB Our problems seem big, but are they really? Tuesday, June 8, 2010
  • 8. So what is Big Data anyway? “The term Big data from software engineering and computer science describes datasets that grow so large that they become awkward to work with using on-hand database management tools” kilo to mega to giga to tera to peta to exa to zetta to yotta Tuesday, June 8, 2010
  • 9. NoSQL = Not Only SQL Trade-Offs, e.g. transactions, data loss e.g. Document Stores (MongoDB) e.g. Key-Value Stores (MemcacheDB) e.g. Graph Databases (Neo4j) Map/Reduce algorithm Tuesday, June 8, 2010
  • 10. Medium Data “With yesterday's scientific technology most businesses should be able to handle their data analysis needs.” HC: 12 Gb logfiles / day = medium data problem Solved (?) with: RDBMS + NoSQL (2006) Bigtable: A Distributed Storage System for Structured Data, Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber (2004) MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat Tuesday, June 8, 2010
  • 11. 3 sexy skills of data geeks “The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it. Hal Valerian (Google)” http://dataspora.com/blog/sexy-data-geeks/ Tuesday, June 8, 2010
  • 12. 3 skills: statistics sentiment analysis machine learning natural language processing recommendation engines good old-fashioned regression Tuesday, June 8, 2010
  • 13. 3 skills: visualization Q: Are you hiring statisticians, visualization experts & data plumbers? Vs. TheOathMeal Edward Tufte, Ben Fry Tuesday, June 8, 2010
  • 14. 3 skills: data plumbing Glue languages: Python, Perl, regex, XSLT Admin: setting up, maintaining clusters Affinity with OSS & *nix NoSQL = NoSchema = Transform Data /^([w!#$%&'*+-/=?^`{|}~]+.)*[w!#$%& '*+-/=?^`{|}~]+@((((([a-z0-9]{1}[a-z0-9-]{0,62}[a- z0-9]{1})|[a-z]).)+[a-z]{2,6})|(d{1,3}.){3}d{1,3}(:d{1,5})?)$/i Tuesday, June 8, 2010
  • 15. More Data beats smart algorithms face recognition spelling correction machine translation http://videos.syntience.com/ai-meetups/peternorvig.html http://dataspora.com/blog/tipping-points-and-big-data/ Tuesday, June 8, 2010
  • 16. Ethics of data Black Hat vs. White Hat <=> Black Data vs. White data White: Amazon free public datasets (e.g. human genome) Black: Scientific climate data (or the lack of PUBLIC data) Just like money, information flows to the least taxed location in a global world. Tuesday, June 8, 2010
  • 17. Take-Away & Discuss “Don't throw away data if you don’t have to, because unlike material goods, data becomes more valuable the more of it is created. As a society, I don't think we understand this completely yet.” q: Who is using a NoSQL db? Share Stories? q: Do you know how much data you are q: Do you hire statisticians? throwing away? q: Do you hire visualization q: Any tips on introducing NoSQL in experts? companies? q: Share: how big is your data? q: Do you own your customer data or q: Do you own your analytics data? does Facebook? q: How are you exploiting q: Do you own your content or does asynchronicity? Google? q: Should information be regulated (privacy)? Can it? Tuesday, June 8, 2010

Editor's Notes

  1. Does a 500 Gb stick exist? yes, this is a quiz, internet is allowed no cheating, no SSD drives
  2. Not it doesn’t. Chinese fake. A bit better than this one. When will you think a 1 Tb USB stick will exist? Petabyte? We mostly believe in Moore’s law &amp; that’s a problem.
  3. Big Data: what is it? Setup the systems. Data scientists: who are they? Hire the people. Discuss!
  4. growing pains
  5. The web is full of &amp;quot;data-driven apps.&amp;quot; We are one. But that does not make us “data scientists”Storage &amp; Analysis are separate things. : Operational vs. Analysis datastore
  6. When designing systems, these days you run more and more into I/O bottlenecks.
  7. NoSQL: document-stores, “Turn in your schema at the entrance”, trade-offs, MongoDB, Cassandra, NoSQL = Not ONLY SQL clickpaths question: describe data sizes in audience
  8. Used to be: Big Oil. Big Telco. Big Banking. Big Pharma. BIG in Physics: LHC outputs 24 zettabytes / second. BIG in Genetics: several terabytes per sequencing experiment. Personal genome / Personalized medicine / less than 10 years ago human genome, now 1000 genomes project, SNPs (23andme) 10 &amp; 24 zeroes Illumina sequencer /
  9. yesterday = BigTable, MapReduce, Clustering approx. 5 years old Let&apos;s face it: most businesses do not have the data needs ... Exceptions: Google / Facebook / Twitter. Take away: can you handle medium-data? What tech can be used? What kind of systems can I build? NoSQL.
  10. The human factor: who do I hire? http://radar.oreilly.com/2010/06/what-is-data-science.html http://dataspora.com/blog/sexy-data-geeks/ Do you have a st atistician on board? Do you have a data vi sualization expert on board?Do you have a data plumber on board?
  11. When all of the above fails: crowdsourcing? MTurk
  12. Edward Tufte, Ben Fry Do you have a statistician on board? Do you have a data visualization expert on board?Do you have a data plumber on board?
  13. Peter Norvig spelling corrector, machine translation, image recognition Phase shifts: dig out data that you thought didn’t exist: GayDar, Netflix
  14. Project Gaydar: do you own yourself? Netflix competition: shreddingGoogle trading floor: buy more google stock!# Grey data23andMe:
  15. Is that your data, or are you just happy to see me? How big is your data (Share)Who is using a NoSQL db? Share?Do you have statisticians? Visual experts? Data plumbe