Southeast Linuxfest -- Is BIG DATA Like High School Sex?

Uploaded on

Is BIG DATA something real? Why do we need it? Well, this is a skeptical view on the subject.

Is BIG DATA something real? Why do we need it? Well, this is a skeptical view on the subject.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. IsIs Big DataBig Data like High Schoollike High School Sex -Sex - Lots of TalkLots of Talk but little action?but little action?
  • 2. Google Test ● BIG DATA 838,000,000
  • 3. Google Hit Test ● BIG DATA 838,000,000 ● World Peace 118,000,000
  • 4. Google Hit Test ● BIG DATA 838,000,000 ● World Peace 118,000,000 ● Cure Cancer 20,800,000
  • 5. Google Hit Test ● BIG DATA 838,000,000 ● World Peace 118,000,000 ● Cure Cancer 20,800,000 ● Kardashian 64,300,000 ● Roswell UFO 259,000 ● JFK Conspiracy 3,240,000
  • 6. Is Big Data Good?
  • 7. Wall Street Journal May 19th 2014 Big Data Banking Is Not Just for Big Banks By Seth Rosensweig, John Milani and Michael B. Flynn ✔ The key element in success for any project involving Big Data is accepting and embracing decision making with less- than-ideal information.
  • 8. What could go wrong there?
  • 9. New York Times Nov 2004 What Wal-Mart Knows About Customers' Habits By CONSTANCE L. HAYS ● HURRICANE FRANCES was on its way, barreling across the Caribbean, threatening a direct hit on Florida's Atlantic coast. Residents made for higher ground, but far away, in Bentonville, Ark., executives at Wal-Mart Stores decided that the situation offered a great opportunity for one of their newest data-driven weapons, something that the company calls predictive technology ● A week ahead of the storm's landfall, Linda M. Dillman, Wal-Mart's chief information officer, pressed her staff to come up with forecasts based on what had happened when Hurricane Charley struck several weeks earlier. Backed by the trillions of bytes' worth of shopper history that is stored in Wal-Mart's computer network, she felt that the company could "start predicting what's going to happen, instead of waiting for it to happen," as she put it ● The experts mined the data and found that the stores would indeed need certain products - and not just the usual flashlights. "We didn't know in the past that strawberry Pop- Tarts increase in sales, like seven times their normal sales rate, ahead of a hurricane," Ms. Dillman said in a recent interview. "And the pre-hurricane top-selling item was beer."
  • 10. AR Redneck != FL Redneck
  • 11. Define the Problem!
  • 12. ● Big data is a blanket term for any collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
  • 13. What Parts do you need?
  • 14. What sort of Hardware do you need for a Hadoop Cluster? Here are the recommended specifications for DataNode/TaskTrackers in a balanced Hadoop cluster: ● 12-24 1-4TB hard disks in a JBOD (Just a Bunch Of Disks) configuration ● 2 quad-/hex-/octo-core CPUs, running at least 2- 2.5GHz ● 64-512GB of RAM ● Bonded Gigabit Ethernet or 10Gigabit Ethernet (the more storage density, the higher the network throughput needed) select-the-right-hardware-for-your-new-hadoop- cluster/ Here are the recommended specifications for NameNode/JobTracker/Standby NameNode nodes. The drive count will fluctuate depending on the amount of redundancy: ● 4–6 1TB hard disks in a JBOD configuration (1 for the OS, 2 for the FS image [RAID 1], 1 for Apache ZooKeeper, and 1 for Journal node) ● 2 quad-/hex-/octo-core CPUs, running at least 2- 2.5GHz ● 64-128GB of RAM ● Bonded Gigabit Ethernet or 10Gigabit Ethernet
  • 15. reveal-hadoop-maturity-curve/ Kaushik says that the average Hadoop cluster size reflects follows a fairly predictable curve. “Our observation is that companies typically experiment with cluster size of under 100 nodes and expand to 200 or more nodes in the production stages. Some of the advanced adopters cluster sizes are over 1,000 nodes.”
  • 16. That is over $400K in servers alone!(1) ● Then add in floor space, power, a few DevOps minions, a/c, support contracts, extra cleaning staff, and miscellaneous computer room stuff! ● (1) Yes, you will get a discount if you buy 200
  • 17. DataInformed April 2014 What it Takes to Succeed with Big Data by Thomas H. Davenport ● Jeff Bezos of Amazon is known for saying, “We never throw away data,” simply because it is difficult to know when it may become important for a product or service offering down the road.
  • 18. ● Big data is difficult to work with using most relational database management systems and desktop statistics and visualization packages, requiring instead "massively parallel software running on tens, hundreds, or even thousands of servers". What is considered "big data" varies depending on the capabilities of the organization managing the set, and on the capabilities of the applications that are traditionally used to process and analyze the data set in its domain. "For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration."
  • 19. Digital Landfill ● No, you can not keep all the data ✔ PCI ✔ IRS ✔ HIPPA ✔ ?
  • 20. The Vendors
  • 21. IBM @ Hadoop World '13 ● Lots of new vendors ● Market will shake out 75% in two years ● Therefore buy IBM as they are an old company
  • 22. And How is Big Data at Solving Problems? Really??
  • 23. Science Fair Projects ● – During economic downturn, AVIS decides to focus on customer service. ● – Lady Gaga asks Facebook & Twitter fans to join mail list ● digital – 1-800-Flowers remembers import dates such as birthdays ● – Cycling team picks up five seconds!
  • 24. Hawthorne Effect
  • 25. %27s_paradox#Westinghouse_effect ● efficiency engineers in the 1920s and 1930s were trying to determine if improved working conditions such as better lighting improved the performance of production workers. The engineers noted that when they provided better working conditions in the production line, efficiency increased. But when the engineers returned the production line to its original conditions and observed the workers, their efficiency increased again. The engineers determined that it was merely the observation of the factory workers, not the changes in the conditions in production line, that increased the measured efficiency
  • 26. Is the world full of new data?
  • 27. Example of Data Creep ● Gender – Female – Male
  • 28. Example of Data Creep ● Gender – Female – Male – Null (no data)
  • 29. Example of Data Creep ● Gender – Female – Male – Null (no data) – State of California has 17 official statuses – Facebook has 50+
  • 30. That may be why DBAs Go Bald!
  • 31. Lets See How Past Predictions Turned Out as a Guide
  • 32. history-of-big-data/ ● 1944 Fremont Rider, Wesleyan University Librarian, publishes The Scholar and the Future of the Research Library. He estimates that American university libraries were doubling in size every sixteen years. Given this growth rate, Rider speculates that the Yale Library in 2040 will have “approximately 200,000,000 volumes, which will occupy over 6,000 miles of shelves… [requiring] a cataloging staff of over six thousand persons.”
  • 33. Yale Library Today ● 15,000,000 volumes as of 2014 – 185,000,000 volumes to go!!!
  • 34. Yale Library Today ● 15,000,000 volumes as of 2014 – 185,000,000 volumes to go!!!
  • 35. Is BIG DATA new-sh?? ● 1961 Derek Price publishes Science Since Babylon, in which he charts the growth of scientific knowledge by looking at the growth in the number of scientific journals and papers. He concludes that the number of new journals has grown exponentially rather than linearly, doubling every fifteen years and increasing by a factor of ten during every half-century. Price calls this the “law of exponential increase,” explaining that “each [scientific] advance generates a new series of advances at a reasonably constant birth rate, so that the number of births is strictly proportional to the size of the population of discoveries at any given time.”
  • 36. Birth Rate … dropping
  • 37. 1971 Arthur Miller ● The Assault on Privacy -- “Too many information handlers seem to measure a man by the number of bits of storage capacity his dossier will occupy.”
  • 38. 1971 Arthur Miller ● The Assault on Privacy -- “Too many information handlers seem to measure a man by the number of bits of storage capacity his dossier will occupy.”
  • 39. So Big Data ain't that new, eh? ● 1997 Michael Lesk publishes “How much information is there in the world?” Lesk concludes that “There may be a few thousand petabytes of information all told; and the production of tape and disk will reach that level by the year 2000. So in only a few years, (a) we will be able [to] save everything–no information will have to be thrown out, and (b) the typical piece of information will never be looked at by a human being.”
  • 40. Yeah, Big Data! ● May 2012 Danah Boyd and Kate Crawford publish “Critical Questions for Big Data” in Information, Communications, and Society. They define big data as “a cultural, technological, and scholarly phenomenon that rests on the interplay of: (1) Technology: maximizing computation power and algorithmic accuracy to gather, analyze, link, and compare large data sets. (2) Analysis: drawing on large data sets to identify patterns in order to make economic, social, technical, and legal claims. (3) Mythology: the widespread belief that large data sets offer a higher form of intelligence and knowledge that can generate insights that were previously impossible, with the aura of truth, objectivity, and accuracy.”
  • 41. Maybe it is not BIG DATA
  • 42. Some Examples, please, Dave!
  • 43.
  • 44. kanazawa
  • 45.
  • 46. Maybe for IE Tech Support Engineers
  • 47. So what do YOU do? ● Quantify why you really want a BIG DATA project – Like to buy servers in quantities of 10,000 – You will be at retirement age by time the project really gets reviewed – You own stock in disk drive companies – Your stochastic analysis shows your boss will not understand anyway, so why not! – Probabilistic study of patterns ROI > Co$t
  • 48. Questions, Hopefully Answers ● This slide desk will be on ● @stoker ●