Thinking Big with Big Data


Published on

A non-technical introduction to Big Data that conveys the core concepts and ideas of big data without giving into the hype.

Published in: Data & Analytics
  • Nice presentation. I was looking for a OO-CDC dwh concept. Not the classic Tool semantic DV or transactional strcuure. As objects and object are having content to be grouped.clustered very different as being delivered analytics cou be more intersesting.
    Are you sure you want to  Yes  No
    Your message goes here
  • Excellent presentation! Helped me get my mind around an intimidating subject.
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Quote by
  • See
  • Inspired by Eric Raymond’s Cathedral and the Bazaar -
  • BASE (basically available soft-state eventual consistency)
    See CAP theorem for more details
  • Big data might not save the world, but it could entertain us
  • “Big Data and You” sounds like a good children’s book title.
  • This is admin screen for Amazon Web Services. Not all of these services are Big Data, but it gives you a good idea of an integrated Big Data platform.

    Although use of the term data science has exploded in business environments, many academics and journalists see no distinction between data science and statistics. Writing in Forbes, Gil Press argues that data science is a buzzword without a clear definition and has simply replaced “business analytics” in contexts such as graduate degree programs.[13] In the question-and-answer section of his keynote address at the Joint Statistical Meetings of American Statistical Association, noted applied statistician Nate Silver said, “I think data-scientist is a sexed up term for a statistician....Statistics is a branch of science. Data scientist is slightly redundant in some way and people shouldn’t berate the term statistician.”[14]
  • From Drew Conway
  • See Nassim Taleb’s excellent essay The Fourth Quadrant -
  • See for datasets
  • Thinking Big with Big Data

    1. 1. Thinking Big An Introduction to Big Data
    2. 2. About Me Shawn Hermans ● Data Engineer/Scientist ● Technology consultant ● Physics, math, data geek
    3. 3. About this Talk ● Non-technical introduction to Big Data ● Not focused on any technology or platform ● Focus on concepts
    4. 4. Should you believe the hype?
    5. 5. ● No need for scientific method ● Predict disease outbreaks before the CDC ● Cure cancer ● Innovating healthcare ● Solve world hunger ● Bring about world peace Big Data Promises
    6. 6. Big Data Criticism ● Garbage in, Garbage out ● Ignores the role of the scientific method ● Lots of questions don’t require large amounts of data to get good stats ● Privacy issues
    7. 7. Big Data is just another way to think about data
    8. 8. Mental Models “A mental model is simply a representation of an external reality inside your head. Mental models are concerned with understanding knowledge about the world.” - Farnam Street Blog
    9. 9. Examples ● Occam's razor ● Mind maps ● Law of supply and demand ● Never get in a land war in Asia
    10. 10. All models are wrong, but some are useful
    11. 11. Relational Resistance Resistance to big data concepts, technologies, and techniques because of belief that the relational model is the only way to think about data. See also: Theory induced blindness
    12. 12. Data Mental Models ● Relational ● Linked ● Object Oriented ● Geospatial ● Temporal ● Semantic ● Event Based ● Data as Code ● Bayesian ● Unstructured
    13. 13. What is Big Data?
    14. 14. “Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.” According to Gartner
    15. 15. According to Me Big data is the Bazaar to traditional data’s Cathedral
    16. 16. Cathedral and Bazaar Traditional Data ● Clean ● Top down ● Carefully collected ● Scales vertically ● One true way Big Data ● Disorderly ● Bottom up ● Randomly collected ● Scales horizontally ● More than one way
    17. 17. Big Data Differences Relational ● Normalization ● ACID ● SQL/Query ● Structured/Schema Big Data ● Denormalization ● BASE ● MapReduce/Other ● Loosely Structured
    18. 18. Integrating all available data is the promise of Big Data
    19. 19. Why should you care?
    20. 20. Information as an Asset ● Target specific customer's needs rather than broad segments ● Just-in-time inventory management ● Evaluating demand for product ● Predict and track traffic patterns
    21. 21. Big Data and You ● What information do you have, that no one else has? ● Can you easily integrate your data or is it locked in silos? ● What data don’t you collect? ● What data don’t you archive?
    22. 22. Big Data Technology
    23. 23. Big Data Platforms Cloud ● AWS ● Google ● Microsoft Hadoop ● Cloudera ● MapR ● Hortonworks This isn’t an all inclusive list, but a sample of the big players in the space.
    24. 24. Big Data Stack ● Batch Processing ● Data Collection ● SQL/Query ● Search ● Machine Learning ● Serialization ● Security ● Stream Processing ● File Storage ● Resource management ● Online NoSQL ● Data Pipeline
    25. 25. What about data science?
    26. 26. ● Data science is statistics on a Mac ● A data scientist is a statistician who lives in San Francisco ● Person who is better at statistics than any software engineer and better at software engineering than any statistician. What IS Data Science?
    27. 27. The need for Data Science ● There is a LOT of data ● Too much data for people to look at it all ● Probabilistic models help extract signal from the noise ● Need to automate the analysis and exploitation of data
    28. 28. Big Data has its limits
    29. 29. Black Swans and Big Data ● There are fundamental limits to prediction ● Hard to predict rare events where no prior data exists (i.e. Black Swans) ● Complex systems often have feedback loops (e.g. stock market)
    30. 30. What’s next?
    31. 31. Business ● Identify some unresolved questions ● Figure out what data could answer those questions ● Pick the easiest and test out your hypothesis Getting Started Technology ● Pick a technology you know or want to learn ● Pick a platform ● Pick a data set and identify some basic problems to solve
    32. 32. My Info Twitter: @shawnhermans Github: Blog: (In Progress) Slideshare: Quora:
    33. 33. Backup Slides
    34. 34. The Fourth Quadrant and the Failure of Statistics
    35. 35. Soothsayer ● Simple HTTP/JSON API for training/classifying data ● Lots of built in classifier statistics