Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Bigdata vs. Data Warehousing     Synergy or Conflict?          Thomas Kejser        thomas@kejser.org       http://blog.ke...
Who is this Guy?Thomas Kejserhttp://blog.kejser.org@thomaskejser• Formerly: Lead SQLCAT EMEA• Now:      CTO FusionIo EMEA•...
Human Consciousness Doesn’t Scale                 10                 9Billion Humans                 8                 7  ...
Text Messages in a TableCREATE TABLE AllTexts (    Sender BIGINT                 8B    , Receiver BIGINT             8B   ...
How much do we text?• World Average    •   6.1 Trillion Text Messages / year    •   About 80% cell phone coverage    •   7...
How much will we EVER text?• 9B people acting like teenagers (in 2050)  • 50 texts/day• That’s 450 billion texts/day  • 16...
Moore’s Hard Drives       LOGCapacity GB                  Can it be done?                                    Year
How Large is this/year?Hard Disk (4TB) : 2.5” Wine Bottle (75cl): 4.0”            About 1500 Wine Bottles
In the Data Center• Calculating:  • 2U Storage=24 Disks    (includes compute)  • 4TB per Disk  • 100TB in 2U (a bit    les...
Warehouses Serve us Well..
… And it is Becoming a Commodity• Good Management  Interfaces• Standard SQL  • with a few extensions• Appliances• Support ...
vs.
PDW vs. Hive – Scan/seekQuery 1                     Query 2SELECT count(*)             SELECT max(l_quantity)FROM lineitem...
PDW vs. Hive - Joins                                 PDW-U:SELECT max(l_orderkey)           • orders partitioned on c_cust...
What does Big Data need to Catch up?• Thread startup times• Co-location awareness• Files vs. optimized DB memory  structur...
• What is Bigdata           Very Unstructured Data
How many Pictures of Cats?• Flickr Today:  • 300MB/month  • 2GB/year  • 51M users (too small?)• Estimate: 102 PB /  year• ...
How big is this in wine bottles?
We have learned how to store it!
What is HDFS?• Distributed File  System• Open Source• No more SAN• The Failure  Unit is the  Server
Fully unstructured data is          boring…Unless you get money for        storing it
Acquiring Personal InformationYour Semi-structured Data, the Old Fashioned Way
The Social AngleWho do you talk to and how often?
The ReasonsWhy do you own a cell phone?
Saturday, 1:39am   - at The PubYour Semi-structured Data, For Free
Big Value      Extraction of of meaning and insightfrom semi-structured data
Extracting Meaning from HumansMethod                             ExamplesTurn semi-structure to structure   Image recognit...
Find New Customers “Michael, who is                                Tommy                       Thomas respected among his ...
Cross Sell “Families who own an Aston Martin will often buy a                 Mini Cooper too”
Free Information
Need: Lots of CPU Cores!
Need: Data Centers!
Provisioning has to be REALLY fast
Things to Learn for the Future• Get good at  • Statistics (again)  • Distributed Algorithms  • Tuning• Understand Physical...
Something is Changing      Today                             Tomorrow     CAPEX Hardware     OPEX Hardware       You
The Mother of All Stovepipes
Big Data / Staging                (No Model)Data youare afraid                          Data You      Deliveryto lose     ...
Synergy              Create Structure                  for me                                 Warehouse          Here is a...
Applying Social Media to Structure
Summary    Data Warehouse                 Big Data•   There is a model               •   Don’t bother modeling!•   Seek Co...
&
Upcoming SlideShare
Loading in …5
×

Big Data vs Data Warehousing

4,543 views

Published on

An attempt to fi

Published in: Technology
  • Be the first to comment

Big Data vs Data Warehousing

  1. 1. Bigdata vs. Data Warehousing Synergy or Conflict? Thomas Kejser thomas@kejser.org http://blog.kejser.org @thomaskejser
  2. 2. Who is this Guy?Thomas Kejserhttp://blog.kejser.org@thomaskejser• Formerly: Lead SQLCAT EMEA• Now: CTO FusionIo EMEA• 15 year database experience• Performance Tuner
  3. 3. Human Consciousness Doesn’t Scale 10 9Billion Humans 8 7 6 5 2000 2050 2100 2150 2200 2250 Year Source: United Nations Projections
  4. 4. Text Messages in a TableCREATE TABLE AllTexts ( Sender BIGINT 8B , Receiver BIGINT 8B , SenderLocation BIGINT 8B , ReceiverLocation BIGINT 8B , Time DATETIME 8B , SMS VARCHAR(140) 140B) = 180Bytes
  5. 5. How much do we text?• World Average • 6.1 Trillion Text Messages / year • About 80% cell phone coverage • 7 billion people • 3 messages/day/person• But: • Teenagers: 50 messages/daySource: Pew Internet Research 2010 & ITU
  6. 6. How much will we EVER text?• 9B people acting like teenagers (in 2050) • 50 texts/day• That’s 450 billion texts/day • 164 Trillion texts/year (20x today) • 180 bytes each • Assume x3 compression• Approximation: 10 Petabytes/year in 2050
  7. 7. Moore’s Hard Drives LOGCapacity GB Can it be done? Year
  8. 8. How Large is this/year?Hard Disk (4TB) : 2.5” Wine Bottle (75cl): 4.0” About 1500 Wine Bottles
  9. 9. In the Data Center• Calculating: • 2U Storage=24 Disks (includes compute) • 4TB per Disk • 100TB in 2U (a bit less) • 10PB = 200U storage• About six racks
  10. 10. Warehouses Serve us Well..
  11. 11. … And it is Becoming a Commodity• Good Management Interfaces• Standard SQL • with a few extensions• Appliances• Support system• Homogenous HW • In chunks
  12. 12. vs.
  13. 13. PDW vs. Hive – Scan/seekQuery 1 Query 2SELECT count(*) SELECT max(l_quantity)FROM lineitem FROM lineitem WHERE l_orderkey > 1000 and l_orderkey < 100000 GROUP BY l_linestatus Secs. 1500 1000 Hive 500 PDW 0 Query 1 Query 2
  14. 14. PDW vs. Hive - Joins PDW-U:SELECT max(l_orderkey) • orders partitioned on c_custkeyFROM ordersJOIN lineitem • lineitem partitioned on l_partkeyON l_orderkey = o_orderkey PDW-P: • orders partitioned on o_orderkey • lineitem partitioned on l_orderkey Secs. 4000 3000 Hive 2000 PDW-U 1000 PDW-P 0 Hive PDW-U PDW-P
  15. 15. What does Big Data need to Catch up?• Thread startup times• Co-location awareness• Files vs. optimized DB memory structures• Column stores and other DB tech Generic is good…… but when there is structure, make use of it!
  16. 16. • What is Bigdata Very Unstructured Data
  17. 17. How many Pictures of Cats?• Flickr Today: • 300MB/month • 2GB/year • 51M users (too small?)• Estimate: 102 PB / year• 10 x text messages Source: WikiPedia
  18. 18. How big is this in wine bottles?
  19. 19. We have learned how to store it!
  20. 20. What is HDFS?• Distributed File System• Open Source• No more SAN• The Failure Unit is the Server
  21. 21. Fully unstructured data is boring…Unless you get money for storing it
  22. 22. Acquiring Personal InformationYour Semi-structured Data, the Old Fashioned Way
  23. 23. The Social AngleWho do you talk to and how often?
  24. 24. The ReasonsWhy do you own a cell phone?
  25. 25. Saturday, 1:39am - at The PubYour Semi-structured Data, For Free
  26. 26. Big Value Extraction of of meaning and insightfrom semi-structured data
  27. 27. Extracting Meaning from HumansMethod ExamplesTurn semi-structure to structure Image recognition, network proximity and super nodes, social mediaNeedle in a haystack Extract outliers, FraudHerd behaviors Clustering, Pattern Recognition, “Customers who bought this also bought”Text classification and search Text indexes, syntactic counting, pagerankText to structure Semantic analysis, loose structure into structure
  28. 28. Find New Customers “Michael, who is Tommy Thomas respected among his peers, Michael often talks about his new, cool gadgets”
  29. 29. Cross Sell “Families who own an Aston Martin will often buy a Mini Cooper too”
  30. 30. Free Information
  31. 31. Need: Lots of CPU Cores!
  32. 32. Need: Data Centers!
  33. 33. Provisioning has to be REALLY fast
  34. 34. Things to Learn for the Future• Get good at • Statistics (again) • Distributed Algorithms • Tuning• Understand Physical Constraints• Acquire deep domain knowledge
  35. 35. Something is Changing Today Tomorrow CAPEX Hardware OPEX Hardware You
  36. 36. The Mother of All Stovepipes
  37. 37. Big Data / Staging (No Model)Data youare afraid Data You Deliveryto lose actually need (Model)
  38. 38. Synergy Create Structure for me Warehouse Here is a table
  39. 39. Applying Social Media to Structure
  40. 40. Summary Data Warehouse Big Data• There is a model • Don’t bother modeling!• Seek Co-location • Optional Co-Location• Respond in seconds • Respond in minutes• Calculate first, query after • Calculate while querying• Expensive HW • Cheap HW• Optimise for target HW • Good enough on all HW• Homogenous HW • Heterogeneous HW• Pay vendor, expect • Free license, optimise optimised yourself
  41. 41. &

×