We are at the end of the growth curve... 9B is our total population... This is an important observation because many data estimates are based on human activity and has so far assumed exponention growthm.. This is NOT the case anymore!
This show the development of hard drive capacity over time
The calculation is not meant to be read, just letting people know we did the calc and what it PHYSICALLY means (see the animation)... There is a real cost to storing a lot of data, and this is one of the reasons cloud makes a lot of senseWine bottles
This is Hyde Park.. From on end to the other...
Big Data vs Data Warehousing
Bigdata vs. Data Warehousing Synergy or Conflict? Thomas Kejser firstname.lastname@example.org http://blog.kejser.org @thomaskejser
Who is this Guy?Thomas Kejserhttp://blog.kejser.org@thomaskejser• Formerly: Lead SQLCAT EMEA• Now: CTO FusionIo EMEA• 15 year database experience• Performance Tuner
Human Consciousness Doesn’t Scale 10 9Billion Humans 8 7 6 5 2000 2050 2100 2150 2200 2250 Year Source: United Nations Projections
Text Messages in a TableCREATE TABLE AllTexts ( Sender BIGINT 8B , Receiver BIGINT 8B , SenderLocation BIGINT 8B , ReceiverLocation BIGINT 8B , Time DATETIME 8B , SMS VARCHAR(140) 140B) = 180Bytes
How much do we text?• World Average • 6.1 Trillion Text Messages / year • About 80% cell phone coverage • 7 billion people • 3 messages/day/person• But: • Teenagers: 50 messages/daySource: Pew Internet Research 2010 & ITU
How much will we EVER text?• 9B people acting like teenagers (in 2050) • 50 texts/day• That’s 450 billion texts/day • 164 Trillion texts/year (20x today) • 180 bytes each • Assume x3 compression• Approximation: 10 Petabytes/year in 2050
Moore’s Hard Drives LOGCapacity GB Can it be done? Year
How Large is this/year?Hard Disk (4TB) : 2.5” Wine Bottle (75cl): 4.0” About 1500 Wine Bottles
In the Data Center• Calculating: • 2U Storage=24 Disks (includes compute) • 4TB per Disk • 100TB in 2U (a bit less) • 10PB = 200U storage• About six racks
PDW vs. Hive – Scan/seekQuery 1 Query 2SELECT count(*) SELECT max(l_quantity)FROM lineitem FROM lineitem WHERE l_orderkey > 1000 and l_orderkey < 100000 GROUP BY l_linestatus Secs. 1500 1000 Hive 500 PDW 0 Query 1 Query 2
PDW vs. Hive - Joins PDW-U:SELECT max(l_orderkey) • orders partitioned on c_custkeyFROM ordersJOIN lineitem • lineitem partitioned on l_partkeyON l_orderkey = o_orderkey PDW-P: • orders partitioned on o_orderkey • lineitem partitioned on l_orderkey Secs. 4000 3000 Hive 2000 PDW-U 1000 PDW-P 0 Hive PDW-U PDW-P
What does Big Data need to Catch up?• Thread startup times• Co-location awareness• Files vs. optimized DB memory structures• Column stores and other DB tech Generic is good…… but when there is structure, make use of it!
Saturday, 1:39am - at The PubYour Semi-structured Data, For Free
Big Value Extraction of of meaning and insightfrom semi-structured data
Extracting Meaning from HumansMethod ExamplesTurn semi-structure to structure Image recognition, network proximity and super nodes, social mediaNeedle in a haystack Extract outliers, FraudHerd behaviors Clustering, Pattern Recognition, “Customers who bought this also bought”Text classification and search Text indexes, syntactic counting, pagerankText to structure Semantic analysis, loose structure into structure
Find New Customers “Michael, who is Tommy Thomas respected among his peers, Michael often talks about his new, cool gadgets”
Cross Sell “Families who own an Aston Martin will often buy a Mini Cooper too”
Summary Data Warehouse Big Data• There is a model • Don’t bother modeling!• Seek Co-location • Optional Co-Location• Respond in seconds • Respond in minutes• Calculate first, query after • Calculate while querying• Expensive HW • Cheap HW• Optimise for target HW • Good enough on all HW• Homogenous HW • Heterogeneous HW• Pay vendor, expect • Free license, optimise optimised yourself