Hadoop and the Rise of Big Data           February 21, 2013             Donald Miner            @donaldpminer        Donal...
About Don
Hadoop•   Distributed platform up to thousands of nodes•   Data storage and application framework•   Started at Yahoo!•   ...
Hadoop users•   Yahoo!                 •   Riot Games•   Facebook               •   ComScore•   eBay                   •  ...
Buzzword glossary•   Unstructured & Structured Data•   NoSQL•   Big Data (volume, velocity, variety)•   Data Science•   Cl...
Hadoop component overview• Core components:  – HDFS (Hadoop Distributed File System)  – MapReduce (Data analysis framework...
Use cases• Text processing    – Indexing, counting, processing•   Large-scale reports•   Data science•   Mixing data sourc...
HDFS• Stores files in folders (that’s it)    – Nobody cares what’s in your files•   Chunks large files into blocks (~64MB-...
HDFS Demonstration
MapReduce•   Analyzes data in HDFS where the data is•   Jobs are split into Mappers and Reducers•   JobTracker – keeps tra...
MapReduce Demonstration
Hadoop ecosystem• HDFS and MapReduce don’t do everything• Pig – high-level language        grpd = GROUP logs BY userAgent;...
Cool thing #1: Linear Scalability• HDFS and MapReduce scale linearly• If you have twice as many computers, things run  twi...
Cool thing #2: Schema on Read       Before:       ETL, schema design, tossing out original data               NOW:LOAD DAT...
Cool thing #3: Transparent Parallelism                                                                    RPC?  Code deplo...
Cool thing #4: Cheap• Commodity hardware (meh)• Open source (people cost more though)• Add more hardware later
How to get started• Install Hadoop in a Linux VM  – Wait how is this helpful?? Hadoop is distributed!• Use Google (serious...
Stuff Hadoop is good at•   Batch processing•   Processing lots of data•   Outputting lots of data•   Storing lots of histo...
Stuff Hadoop is not good at• Hadoop is a freight truck, not a sports car• Updating data (think “append-only”)• Being easy ...
QUESTIONS?Hadoop and the Rise of Big Data           February 21, 2013             Donald Miner            @donaldpminer   ...
Upcoming SlideShare
Loading in...5
×

BW Tech Meetup: Hadoop and The rise of Big Data

283

Published on

Hadoop is an open source, distributed computation platform, that is very important in the worlds of search, analytics, and big data. Donald Miner, a Solutions Architect at Greenplum, will give an hour presentation that will focus on ways to get started with Hadoop and provide advice on how successfully utilize the platform

Specific topics of discussion include how Hadoop works, what Hadoop should and should not be used for, MapReduce design patterns, and the upcoming synergy of SQL and NoSQL in Hadoop.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
283
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

BW Tech Meetup: Hadoop and The rise of Big Data

  1. 1. Hadoop and the Rise of Big Data February 21, 2013 Donald Miner @donaldpminer Donald.Miner@emc.com
  2. 2. About Don
  3. 3. Hadoop• Distributed platform up to thousands of nodes• Data storage and application framework• Started at Yahoo!• Open source• Based on a few Google papers (2003, 2004)• Runs on commodity hardware I’M HERE TO TELL YOU WHY HADOOP IS AWESOME
  4. 4. Hadoop users• Yahoo! • Riot Games• Facebook • ComScore• eBay • Twitter• AOL • LinkedIn Hadoop Companies• Cloudera, Hortonworks, EMC/Greenplum, IBM• Numerous startups
  5. 5. Buzzword glossary• Unstructured & Structured Data• NoSQL• Big Data (volume, velocity, variety)• Data Science• Cloud computing
  6. 6. Hadoop component overview• Core components: – HDFS (Hadoop Distributed File System) – MapReduce (Data analysis framework)• Ecosystem – HBase (key-value store) – Pig (high-level data analysis language) – Hive (SQL-like data analysis language) – ZooKeeper (stores metadata) – Other stuff
  7. 7. Use cases• Text processing – Indexing, counting, processing• Large-scale reports• Data science• Mixing data sources (data lakes)• Ad targeting• Image/Video/Audio processing• Cybersecurity
  8. 8. HDFS• Stores files in folders (that’s it) – Nobody cares what’s in your files• Chunks large files into blocks (~64MB-1GB)• Blocks are scattered all over the place• 3 replicates of each block (better safe than sorry)• One NameNode (might be sorry) – Knows which computers blocks live on – Knows which blocks belong to which files• One DataNode per computer (slaves!) – Hosts files
  9. 9. HDFS Demonstration
  10. 10. MapReduce• Analyzes data in HDFS where the data is• Jobs are split into Mappers and Reducers• JobTracker – keeps track of running jobs• TaskTracker – one per computer, executes tasks• Mappers (you code this) – Loads data from HDFS – Filter, transform, parse – Outputs (key, value) pairs• Reducers (you code this, too) – Groups by the mapper’s output key – Aggregate, count, statistics – Outputs to HDFS
  11. 11. MapReduce Demonstration
  12. 12. Hadoop ecosystem• HDFS and MapReduce don’t do everything• Pig – high-level language grpd = GROUP logs BY userAgent; counts = FOREACH grpd GENERATE group, AVG(logs.timeMicroSec)/1.0E+06 AS loadTimeSec; byCount = ORDER counts BY loadTimeSec DESC; top = limit byCount 15;• Hive – high-level SQL language SELECT grp, SUM(col2), COUNT(*) FROM table1 GROUP BY grp;• HBase – key/value store
  13. 13. Cool thing #1: Linear Scalability• HDFS and MapReduce scale linearly• If you have twice as many computers, things run twice as fast• If you have twice as much data, things run twice as slow• If you have twice as many computers, you can store twice as much data• This stays true (some minor caveats)• DATA LOCALITY!!
  14. 14. Cool thing #2: Schema on Read Before: ETL, schema design, tossing out original data NOW:LOAD DATA  ????  PROFIT!! Data is parsed/interpreted as it is loaded out of HDFS What implications does this have? Keep original data around! Have multiple views of the same data! Store first, figure out what to do with it later!
  15. 15. Cool thing #3: Transparent Parallelism RPC? Code deployment? Network programming?Data center fires? Distributed stuff? Inter-process communication? Fault tolerance? Message passing?Threading? Locking? With MapReduce, I DON’T CARE … I just have to fit my solution into this tiny box Solution MapReduce
  16. 16. Cool thing #4: Cheap• Commodity hardware (meh)• Open source (people cost more though)• Add more hardware later
  17. 17. How to get started• Install Hadoop in a Linux VM – Wait how is this helpful?? Hadoop is distributed!• Use Google (seriously)• Some prerequisites: Java, Linux, Data, Time
  18. 18. Stuff Hadoop is good at• Batch processing• Processing lots of data• Outputting lots of data• Storing lots of historical data• Flexible analysis of data• Dealing with unstructured or structured data
  19. 19. Stuff Hadoop is not good at• Hadoop is a freight truck, not a sports car• Updating data (think “append-only”)• Being easy to use – Java – Administration• Hadoop is not good storage (don’t throw away your EMC stuff!)
  20. 20. QUESTIONS?Hadoop and the Rise of Big Data February 21, 2013 Donald Miner @donaldpminer Donald.Miner@emc.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×