Amazon Web Services: EMR (Elastic Map Reduce) with ITOC Australia - What Is EMR/Hadoop?
Elastic MapReduce On AWS Cloudhttp://linkedin.com/in/davidnedved
What is MapReduce?MapReduce is a programming model coming from functional programming(like LISP)MapReduce is "framework" for: ● Processing parallel problems ● Across HUGE datasets ● Using a LARGE number of computers!Made "Popular" by Google● Used to compute the index that maps "terms" to "pages". (AKA Googles Pagerank Algorithm)
Ok - so...???● Primary data has grown exponentially in the last 10 years on the internet...● Secondary data has gone "off the scale" ...● UH? ○ We seem to log everything and "ask questions later" For Instance: ● Recommendations (books, restaurants, etc..) ● Predict Trends (job skills in demand, amazons recent ). ● Show customised Ads on my site etc. ● Record every query a user makes on my site http://w3dt.net/● Big Data is no longer a problem for the big boys (Google, Microsoft etc)● Startups are "epically failing" to get on top of their big data....
Hadoop ● Hadoop can help with BigData ○ Its proven in the field ○ Under active development ○ Will only get cheaper as hardware/AWS prices drop! ● Cheaper storage and retrieval (through a limited SQL interface) ● Easier to use with parallel programming. ● Scalability for storage/retrieval"Ok, so is hadoop a database?" NO, NO, NO! Hadoop is a processing platform. It combines data storage, retrieval and programming into a single highly scalable package.
EMR simply kicks ass● Import/Export your BigData to AWS Platform quickly● Multipart Upload (s3)● Resize running job flows● Balance cost and Performance● Resize based on usage patterns● Access Control --> IAM, VPC, Everything else in standard EC2..
For example...EMR can be used to efficiently export DynamoDB tables to S3, import S3data into DynamoDB, and perform sophisticated queries across tables storedin both DynamoDB and other storage services such as S3. ● By exporting rarely used data to S3 you will save $$$. ● Exported data in S3 is directly queryable (via EMR) ● Join exported tables with current DynamoDB Tables!Create hive table (notice the S3 endpoint)CREATE EXTERNAL TABLE sms_prices_s3 ( code string,country string, network int, networkname string, pricestring )PARTITIONED BY (code string)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ,LOCATION s3://itoc-usergroup/sample ;
For example...Querying the external table (data in S3)SELECT code, country, networkname, priceFROM sms_prices_s3WHERE code = AU; ● Remember; you can run EMR (Hadoop) on just about ANY form of data! ● Use EMR to query your NoSQL DB with SQL like queries (: ● Store your BigData in S3, Dynamo, etc you get the 99.999999999% DurabilityDynamoDB catch-out...If you want to query DynamoDB using Hadoop you MUST use EMR...The library for hive isnt available for your own ec2 instances.
A few real life examples● Data Analytics Google Analytics/Quantcast● Crawling Google Search● Full-text Indexing Just about every HUGE system● Data Mining LinkedIn Maps (: