Tour of Big Data


Published on

Presentation at Southern California Code Camp July 2013 in San Diego. This talk presents you with basic concepts in world of big data and data science, with focus on relational databases, noSQL, MapReduce, machine learning, and data visualization, along with demos of MapReduce in action and Pig on Hadoop. The purpose of this presentation is to get you familiar with terminologies and concepts in data science, and whet  your appetite for further exploration into the world of big data. This presentation is adapted from an online course by Coursera with similar title and scope

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Whenever you see “yutechnet”, it is me Next ask audience about:Developer? DBA? DBE?Worked on any databases beyond relational database?Use Hadoop and other noSQL on a daily basis?
  • Dummy down version of the courseHard to pick topics to shareMajor areas of data science, focus on big data and noSQLGoal: familiar with the big picture and terminologies of data science and speak intelligently about this field, springboard into specific areas you are further interested in
  • Franklin’s key idea: “Big” is relative, it depends on what you try to do
  • Analytics: statistics model, machine learning, slice-dice
  • Call out a few great features about relational databases to set the context of how we got here, and we don’t get lost in the context of big data and noSQL, with bad name/impression as old guardDeclarative – specify what you want, no need to worry about logical or physical operation and optimization
  • Map, shuffle, and reduce.
  • Touch base on HDFS layer about fault tolerance, job tracker, task tracker, etc.
  • Comments in lieu of demo:schema-on-read with LOADRelational JOIN operationOptimization – relational algebraLazy evaluation – no work is done until STORE
  • Pig performance: initially not good as MR, but caught up quickly, now almost the same as MRHive not covered, but 2011 data showed that >90% MR jobs are executed via HiveClear win for a declarative languageDon’t feel bad if you know SQL 
  • About EC:Databases: “Everyone MUST see the same thing, either old or new, no matter how long it takes.”NoSQL: “For large applications, we can’t afford to wait that long, and maybe it doesn’t matter anyway”
  • Memcache: load everything into memory, and scale across hundreds of machines, consistent hashingBigTable – Google 2006, complementary to MapReduce, added index (zoom-in), fast key-based lookup
  • Statistics emphasizes accuracy of model, while ML cares less about the nature of modelThink of the example of building a super-accurate gun
  • Tour of Big Data

    1. 1. Tour of Big Data Raymond Yu Socal Code Camp 2013
    2. 2. About myself •Sr. Database Architect @ BridgePoint Education •Blog •LinkedIn •@yutechnet
    3. 3. About this talk… 7/28/ •Inspired by “Introduction to Data Science” on Coursera (Bill Howe, UW) •Guided tour of topics in data science – MapReduce, Pig – noSQL – Machine Learning – Information Visualization •Goal
    4. 4. Big Data •Volume – Size of data •Velocity – The latency of data processing relative to the growing demand of interactivity •Variety – The diversity of sources, formats, quality, and structures Big Data is any data that is expensive to manage and hard to extract value from. -Michael Franklin
    5. 5. Where does big data come from? •“Data exhaust” from customers •New censor technologies •Individually contributed data in massive scale •Cheap to keep data
    6. 6. Data Science •Data Preparation (at scale) •Analytics •Communication The ability to take data, understand it, process it, extract value from it, visualize it, and communicate it - Hal Varian, Google's Chief Economist
    7. 7. Context… src. Introduction to Data Science course
    8. 8. Relational Databases •SQL as Declarative Language •Indexes – Extract small result from big dataset – Built easily and automatically used when appropriate •Data consistency •“Old-style” scalability
    9. 9. MapReduce •Google paper 2004 •Hadoop 2008 •High level programming model for large- scale parallel data processing •Divide-and-conquer •Mapper + Reducer
    10. 10. “Hello World” of MapReduce Count word frequency in millions of documents
    11. 11. MapReduce Programming Model src. Course slide
    12. 12. Show me the MapReduce… •
    13. 13. MapReduce in Hadoop
    14. 14. Pig • An engine to execute programs on top of Hadoop • Language layer Pig Latin • An Apache open source project ( •Yahoo! 2009
    15. 15. Why use Pig?
    16. 16. In MapReduce…
    17. 17. In Pig Latin
    18. 18. Pig System Overview
    19. 19. Context… src. Introduction to Data Science course
    20. 20. noSQL definitions •A term to designate databases which differ from classic relational databases – Transactional model – Data model •Not much to do with SQL •“not only SQL”
    21. 21. Concepts • CAP Theorem – Consistency – Availability – Partition Tolerance • Eventual consistency Src:
    22. 22. noSQL One-page Overview
    23. 23. Let’s walk through a few •Column definitions •RDBMS •Memcache •Dynamo •CouchDB •BigTable (Hbase)
    24. 24. noSQL Common Features • The ability to replicate and partition data over many servers (scale) • Horizontally scale simple operation throughput over many servers • A simple API - no query language (no SQL) • Weaker concurrency model than ACID transactions (no transaction) • The ability to dynamically add new attributes to data records (no schema)
    25. 25. Machine Learning • Systems that automatically learn programs from data • Prediction – Given examples of inputs and outputs – Learn the relationship between them – Apply the relationship to larger set • Different from statistics model – Large data set over simple model trumpets small data set over sophisticated model
    26. 26. Bertin’s Visual Attributes
    27. 27. Data Encoding Exercise
    28. 28. Information Visualization src.
    29. 29. Closing example Src. Nate Silver Obama’s Data- Driven Campaign • Massive voter db • Hadoop as ETL • Vertica db for slice- and-dice
    30. 30. Questions?