Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
{Python} in Big Data World
Objectives•   What is Bigdata•   What is Hadoop and its Ecosystem•   Writing Hadoop jobs using Map Reduce    programming
structured data     {Relational data with well                    defined schemas}         Multi structured data          ...
Trends … Gartner                        Mobile analytics                Mobility        App stores and Market placeHuman c...
The Problem…
Source : The Economist
The Problem…Facebook     955 million active users as of March 2012,     1 in 3 Internet users have a Facebook     account ...
The Problem…Twitter     500 million users, 340 million daily tweets     1.6 billion search queries a day     7 TB data for...
Big Data Dimensions (V3)                     Volume                 Varity    Velocity                     Value
Hadoop
What is Hadoop … Flexible and available architecture for large scale distributed batch processing on a network of commodit...
Apache top level project              http://hadoop.apache.org/                  500 contributorsIt has one of the stronge...
Inspired by …   {Google GFS + Map Reduce + Big Table}         Architecture behind Google’s                Search Engine   ...
Use cases … What is Hadoop used for    Big/Social data analysis    Text mining, patterns search    Machine log analysis   ...
Who uses Hadoop … long list• Amazon/A9• Facebook• Google• IBM• Disney• Last.fm• New York Times• Yahoo!• Twitter• Linked in
What is Hadoop used for? • Search       Yahoo, Amazon, Zvents • Log processing       Facebook, Yahoo, ContextWeb, Last.fm ...
Our own …     ADDHAAR uses Hadoop and Hbase     for its data processing …
Hadoop ecosystem …                      ZooKeeper      Flume              Oozie                 Whirr       Chukwa        ...
Hive: Datawarehouse infrastructure built on top ofhadoop for data summarization and aggregation ofdata more like in sql li...
Hadoop distribution …
Scale up                                   App           Traditional Databases
Scale out                    App            Hadoop distributed file system
Why Hadoop ?   How Hadoop is different from other   parallel processing architectures such   As MPI, OpenMP, Globus ?
Move compute to data in HadoopWhile in other parallel processing thedata gets distributed to compute.
Hadoop Components …HDFSMap ReduceJob trackerTask TrackerName Node
Python + Analytics   High Level Language                     Highly Interactive                     Highly Extensible  ...
What iscommonbetweenMumbaiDabbawalasand ApacheHadoopSource : Cloudstory.inAuthor : Janakiram MSV
What is MapReduceMapReduce is a programming model forprocessing large data sets on distributedcomputing.
Map reduce steps    Map        Shuffle   Reduce
Map   Shuffle   Reduce
Map Reduce  Java  Hive  Pig Scripts  Datameer  Cascading       Cascalog       Scalding  Streaming frameworks      ...
Pig Script (Word Count)input_lines = LOAD /tmp/my-copy-of-all-pages-on-internet AS (line:chararray);-- Extract words from ...
Hive (WordCount)CREATE TABLE input (line STRING);LOAD DATA LOCAL INPATH input.tsv OVERWRITE INTO TABLE input;-- temporary ...
Hadoop Streaming…  http://www.michael-  noll.com/tutorials/writing-an-hadoop-  mapreduce-program-in-python/
Map: mapper.py#!/usr/bin/env pythonimport sys# input comes from STDIN (standard input)for line in sys.stdin:       # remov...
Reduce: reducer.py#!/usr/bin/env pythonfrom operator import itemgetterimport syscurrent_word = Nonecurrent_count = 0word =...
Reduce: reducer.py ( cont )     # convert count (currently a string) to int     try:         count = int(count)     except...
Reduce: reducer.py (cont)# do not forget to output the last word if needed!if current_word == word:        print %st%s % (...
Thank You
Upcoming SlideShare
Loading in …5
×

Python in big data world

7,642 views

Published on

Provides an insight into bigdata , hadoop, mapreduce , python

  • Be the first to comment

Python in big data world

  1. 1. {Python} in Big Data World
  2. 2. Objectives• What is Bigdata• What is Hadoop and its Ecosystem• Writing Hadoop jobs using Map Reduce programming
  3. 3. structured data {Relational data with well defined schemas} Multi structured data {Social data, blogs , click stream, Machine generated, xml etc…}
  4. 4. Trends … Gartner Mobile analytics Mobility App stores and Market placeHuman computer interface Big Data Personal cloud Multi touch UI In memory computing Advanced Analytics Green data centre Flash Memory Social CRM Solid state drive HTML5 Context aware computing
  5. 5. The Problem…
  6. 6. Source : The Economist
  7. 7. The Problem…Facebook 955 million active users as of March 2012, 1 in 3 Internet users have a Facebook account More than 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) shared each month. Holds 30PB of data for analysis, adds 12 TB of compressed data daily
  8. 8. The Problem…Twitter 500 million users, 340 million daily tweets 1.6 billion search queries a day 7 TB data for analysis generated daily Traditional data storage, techniques & analysis tools just do not work at these scales !
  9. 9. Big Data Dimensions (V3) Volume Varity Velocity Value
  10. 10. Hadoop
  11. 11. What is Hadoop … Flexible and available architecture for large scale distributed batch processing on a network of commodity hardware.
  12. 12. Apache top level project http://hadoop.apache.org/ 500 contributorsIt has one of the strongest eco systems with large no of sub projects Yahoo has one of the biggest installation Hadoop Running 1000s of servers on Hadoop
  13. 13. Inspired by … {Google GFS + Map Reduce + Big Table} Architecture behind Google’s Search Engine Creator of Hadoop project
  14. 14. Use cases … What is Hadoop used for Big/Social data analysis Text mining, patterns search Machine log analysis Geo-spacitial analysis Trend Analysis Genome Analysis Drug Discovery Fraud and compliance management Video and image analysis
  15. 15. Who uses Hadoop … long list• Amazon/A9• Facebook• Google• IBM• Disney• Last.fm• New York Times• Yahoo!• Twitter• Linked in
  16. 16. What is Hadoop used for? • Search Yahoo, Amazon, Zvents • Log processing Facebook, Yahoo, ContextWeb, Last.fm • Recommendation Systems Facebook , Disney • Data Warehouse Facebook, AOL , Disney • Video and Image Analysis New York Times • Computing Carbon Foot Print Opower.
  17. 17. Our own … ADDHAAR uses Hadoop and Hbase for its data processing …
  18. 18. Hadoop ecosystem … ZooKeeper Flume Oozie Whirr Chukwa Avro Sqoop Hive Pig HBase MapReduce/HDFS
  19. 19. Hive: Datawarehouse infrastructure built on top ofhadoop for data summarization and aggregation ofdata more like in sql like language called as hiveQL.Hbase: Hbase is a Nosql columnar database and is animplementation of Google Bigtable. It can scale tostore billions of rows.Flume: Apache Flume is a distributed, reliable, andavailable service for efficiently collecting, aggregating,and moving large amounts of log dataAvro: A data serialization system.Sqoop: Used for transferring bulk data betweenHadoop and traditional structured data stores.
  20. 20. Hadoop distribution …
  21. 21. Scale up App Traditional Databases
  22. 22. Scale out App Hadoop distributed file system
  23. 23. Why Hadoop ? How Hadoop is different from other parallel processing architectures such As MPI, OpenMP, Globus ?
  24. 24. Move compute to data in HadoopWhile in other parallel processing thedata gets distributed to compute.
  25. 25. Hadoop Components …HDFSMap ReduceJob trackerTask TrackerName Node
  26. 26. Python + Analytics High Level Language Highly Interactive Highly Extensible Functional Many Extensible libs like • SciPy • NumPy • Metaplotlib • Pandas • Ipython • StatsModel • Ntltk to name a few.
  27. 27. What iscommonbetweenMumbaiDabbawalasand ApacheHadoopSource : Cloudstory.inAuthor : Janakiram MSV
  28. 28. What is MapReduceMapReduce is a programming model forprocessing large data sets on distributedcomputing.
  29. 29. Map reduce steps Map Shuffle Reduce
  30. 30. Map Shuffle Reduce
  31. 31. Map Reduce Java Hive Pig Scripts Datameer Cascading  Cascalog  Scalding Streaming frameworks  Wukong  Dumbo  MrJobs  Happy
  32. 32. Pig Script (Word Count)input_lines = LOAD /tmp/my-copy-of-all-pages-on-internet AS (line:chararray);-- Extract words from each line and put them into a pig bag-- datatype, then flatten the bag to get one word on each row words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;-- filter out any words that are just white spaces filtered_words = FILTER words BY word MATCHES w+;-- create a group for each wordword_groups = GROUP filtered_words BY word;-- count the entries in each groupword_count = FOREACH word_groups GENERATE COUNT(filtered_words) AScount, group AS word;-- order the records by countordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO /tmp/number-of-words-on-internet;
  33. 33. Hive (WordCount)CREATE TABLE input (line STRING);LOAD DATA LOCAL INPATH input.tsv OVERWRITE INTO TABLE input;-- temporary table to hold wordsCREATE TABLE words (word STRING);SELECT word, COUNT(*) FROM input LATERAL VIEW explode(split(text, )) lTableas word GROUP BY word;
  34. 34. Hadoop Streaming… http://www.michael- noll.com/tutorials/writing-an-hadoop- mapreduce-program-in-python/
  35. 35. Map: mapper.py#!/usr/bin/env pythonimport sys# input comes from STDIN (standard input)for line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # split the line into words words = line.split() # increase counters for word in words: # write the results to STDOUT (standard output); # what we output here will be the input for the # Reduce step, i.e. the input for reducer.py # # tab-delimited; the trivial word count is 1 print %st%s % (word, 1)
  36. 36. Reduce: reducer.py#!/usr/bin/env pythonfrom operator import itemgetterimport syscurrent_word = Nonecurrent_count = 0word = None# input comes from STDINfor line in sys.stdin: # remove leading and trailing whitespace line = line.strip() # parse the input we got from mapper.py word, count = line.split(t, 1)
  37. 37. Reduce: reducer.py ( cont ) # convert count (currently a string) to int try: count = int(count) except ValueError: # count was not a number, so silently # ignore/discard this line continue # this IF-switch only works because Hadoop sorts map output # by key (here: word) before it is passed to the reducer if current_word == word: current_count += count else: if current_word: # write result to STDOUT print %st%s % (current_word, current_count) current_count = count current_word = word
  38. 38. Reduce: reducer.py (cont)# do not forget to output the last word if needed!if current_word == word: print %st%s % (current_word, current_count)
  39. 39. Thank You

×