Your SlideShare is downloading. ×
0
jwoo Woo
HiPIC
CSULA
Big Data and Data Intensive Computing:
Education and Training
Graduate School of Communication & Art
...
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
소개
 Big Data Use Cases
 Data Issues
 Big Dat...
High Performance Information Computing Center
Jongwook Woo
CSULA
Me
 이름: 우종욱
 직업:
 교수 (직책: 부교수), California State Unive...
High Performance Information Computing Center
Jongwook Woo
CSULA
Me
경력 (계속):
2013년 여름 현재 IglooSecurity 자문중:
– Hadoop 및 그...
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
 Grants
 Received Amazon AWS in ...
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
 Certificate
 Certificate of Ach...
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
 Several publications regarding H...
High Performance Information Computing Center
Jongwook Woo
CSULA
What is Big Data, Map/Reduce, Hadoop, NoSQL DB on
Cloud C...
High Performance Information Computing Center
Jongwook Woo
CSULA
Data
Google
“We don’t have a better algorithm
than other...
High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases in Korea
SK Telecomm
Seoul
Credit Cards
Hyu...
High Performance Information Computing Center
Jongwook Woo
CSULA
SK Telecomm
T Map
 Collect GPS traffic data from Taxi, ...
High Performance Information Computing Center
Jongwook Woo
CSULA
Seoul
Night Bus
 Collect GPS traffic data from Taxi
 F...
High Performance Information Computing Center
Jongwook Woo
CSULA
Credit Cards
Apps to find out popular restaurants
Colle...
High Performance Information Computing Center
Jongwook Woo
CSULA
Hyundai Motors
Improve the present and future models
Co...
High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases
President Election
Amazon AWS
HuffPOst | AOL
High Performance Information Computing Center
Jongwook Woo
CSULA
President Election
People Behavior Analysis
Collect peo...
High Performance Information Computing Center
Jongwook Woo
CSULA
HuffPost | AOL [10]
Two Machine Learning Use Cases
Comm...
High Performance Information Computing Center
Jongwook Woo
CSULA
HuffPost | AOL [10]
Parallelize on Hadoop
Good news:
– ...
High Performance Information Computing Center
Jongwook Woo
CSULA
Others
amazon.com
Recommend books to the people
Google...
High Performance Information Computing Center
Jongwook Woo
CSULA
New Data Trend
Sparsity
Unstructured
Schema free data ...
High Performance Information Computing Center
Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byt...
High Performance Information Computing Center
Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
NoSQL DB
H...
High Performance Information Computing Center
Jongwook Woo
CSULA
Big Data for RDBMS
Issues in RDBMS
Hard to scale
– Rela...
High Performance Information Computing Center
Jongwook Woo
CSULA
Big Data for RDBMS (Cont’d)
Issues in RDBMS (Cont’d)
Da...
High Performance Information Computing Center
Jongwook Woo
CSULA
RDBMS Issues
Solution
 Before: Data Warehouse
 Now and...
High Performance Information Computing Center
Jongwook Woo
CSULA
Big Data
Definition
 Systems that supports a non-
expen...
High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases for NoSQL DB [1]
RDBMS replacement
for high-t...
High Performance Information Computing Center
Jongwook Woo
CSULA
Data Store of NoSQL DB
Key/Value store
(Key, Value)
Fu...
High Performance Information Computing Center
Jongwook Woo
CSULA
Data Store of NoSQL DB (Cont’d)
Column-Oriented Stores (...
High Performance Information Computing Center
Jongwook Woo
CSULA
Data Store of NoSQL DB (Cont’d)
 Row Oriented
– 1,Smith,...
High Performance Information Computing Center
Jongwook Woo
CSULA
HBase Schema Example (Student/Course)
 RDBMS
 Students:...
High Performance Information Computing Center
Jongwook Woo
CSULA
Data Store of NoSQL DB (Cont’d)
Document Store
Collecti...
High Performance Information Computing Center
Jongwook Woo
CSULA
Understanding the Document Model [1]
{
_id:“A4304”
author...
High Performance Information Computing Center
Jongwook Woo
CSULA
Document Model Makes Queries Simple [1]
Operators:
$gt, $...
High Performance Information Computing Center
Jongwook Woo
CSULA
Selected Users [1]
High Performance Information Computing Center
Jongwook Woo
CSULA
The Great Divide [1]
MongoDB sweet spot: Easy, Flexible,
...
High Performance Information Computing Center
Jongwook Woo
CSULA
Solutions in Big Data Computation
 Map/Reduce by Google
...
High Performance Information Computing Center
Jongwook Woo
CSULA
Apache Hadoop
 Motivated by Google Map/Reduce and GFS
 ...
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop issues
Map/Reduce is not DB
Algorithm in Restric...
High Performance Information Computing Center
Jongwook Woo
CSULA
MapReduce Pros & Cons Summary
Good when
Huge data for i...
High Performance Information Computing Center
Jongwook Woo
CSULA
MapReduce in Detail
Functions borrowed from functional
p...
High Performance Information Computing Center
Jongwook Woo
CSULA
Map
Convert input data to (key, value) pairs
map() func...
High Performance Information Computing Center
Jongwook Woo
CSULA
Reduce
reduce() combines those intermediate values
into ...
High Performance Information Computing Center
Jongwook Woo
CSULA
Training in Big Data
 Learn by yourself?
Miss many impo...
High Performance Information Computing Center
Jongwook Woo
CSULA
Example: Sort URLs in the largest hit order
Compute the ...
High Performance Information Computing Center
Jongwook Woo
CSULA
Map/Reduce for URL visits
…
…Map1() Map2() Mapm()
Reduce1...
High Performance Information Computing Center
Jongwook Woo
CSULA
Legacy Example
In late 2007, the New York Times
wanted t...
High Performance Information Computing Center
Jongwook Woo
CSULA
Legacy Example (Cont’d)
In late 2007, the New York Times...
High Performance Information Computing Center
Jongwook Woo
CSULA
Supporters of Big Data: Hadoop Ecosystems
 Apache Hadoop...
High Performance Information Computing Center
Jongwook Woo
CSULA
Pig
• developed at Yahoo Research around 2006
o moved int...
High Performance Information Computing Center
Jongwook Woo
CSULA
Hive
• developed at Facebook
o turns Hadoop into a data w...
High Performance Information Computing Center
Jongwook Woo
CSULA
Hive (Cont'd)
• Similarity with PigLatin and SQL,
o HiveQ...
High Performance Information Computing Center
Jongwook Woo
CSULA
Jaql
• developed at IBM.
• a data flow language
o its nat...
High Performance Information Computing Center
Jongwook Woo
CSULA
MapReduce Cons and Future
Bad for
Fast response time
L...
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop 2.0: YARN
Data processing applications and servic...
High Performance Information Computing Center
Jongwook Woo
CSULA
Big Data Supporters
Amazon AWS
Facebook
Twitter
Craig...
High Performance Information Computing Center
Jongwook Woo
CSULA
Amazon AWS
amazon.com
Consumer and seller business
aws...
High Performance Information Computing Center
Jongwook Woo
CSULA
Amazon AWS (Cont’d)
Customers on aws.amazon.com
Samsung...
High Performance Information Computing Center
Jongwook Woo
CSULA
Facebook [7]
Using Apache HBase
 For Titan and Puma
– M...
High Performance Information Computing Center
Jongwook Woo
CSULA
Titan: Facebook
Message services in FB
Hundreds of mill...
High Performance Information Computing Center
Jongwook Woo
CSULA
Puma: Facebook
 ETL
 Extract, Transform, Load
– Data In...
High Performance Information Computing Center
Jongwook Woo
CSULA
Twitter [8]
Three Challenges
Collecting Data
– Scribe a...
High Performance Information Computing Center
Jongwook Woo
CSULA
Craiglist in MongoDB [9]
Craiglist
~700 cities, worldwi...
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop Streaming
 Hadoop MapReduce for Non-Java codes: P...
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop Streaming
 MapReduce Python execution
 http://wi...
High Performance Information Computing Center
Jongwook Woo
CSULA
Conclusion
 Era of Big Data
 Need to store and compute ...
High Performance Information Computing Center
Jongwook Woo
CSULA
Question?
Upcoming SlideShare
Loading in...5
×

Big Data and Data Intensive Computing: Education and Training

300

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
300
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Big Data and Data Intensive Computing: Education and Training"

  1. 1. jwoo Woo HiPIC CSULA Big Data and Data Intensive Computing: Education and Training Graduate School of Communication & Art Yonsei University Shinchon, Korea Sept 5th 2013 Jongwook Woo (PhD) High-Performance Information Computing Center (HiPIC) Educational Partner with Cloudera and Grants Awardee of Amazon AWS Computer Information Systems Department California State University, Los Angeles
  2. 2. High Performance Information Computing Center Jongwook Woo CSULA Contents 소개  Big Data Use Cases  Data Issues  Big Data  Data-Intensive Computing: Hadoop  Training in Big Data  Big Data Supporters
  3. 3. High Performance Information Computing Center Jongwook Woo CSULA Me  이름: 우종욱  직업:  교수 (직책: 부교수), California State University Los Angeles – Capital City of Entertainment  경력:  2002년 부터 교수: Computer Information Systems Dept, College of Business and Economics – www.calstatela.edu/faculty/jwoo5  1998년부터 헐리우드등지의 많은 회사 컨설팅 – 주로 J2EE 미들웨어를 이용한 eBusiness applications 구축 – FAST, Lucene/Solr, Sphinx 검색엔진을 이용한 정보추출, 정보통합 – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등  2009여년 부터 하둡 빅데이타에 관심
  4. 4. High Performance Information Computing Center Jongwook Woo CSULA Me 경력 (계속): 2013년 여름 현재 IglooSecurity 자문중: – Hadoop 및 그 Ecosystems 교육 – 하루에 30GB – 100GB씩 생성되는 보안관련 로그 파일들을 빠르게 데이타 검색하는 시스템 R&D • Hadoop, Solr, Java, Cloudera 이용 2013년 9월 중순: 삼성 종합 기술원 – 3일간 Hadoop 및 그 Ecosystems 교육 예정 – Using Cloudera material in Korea as far as I know
  5. 5. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Grants  Received Amazon AWS in Education Research Grant (July 2012 - July 2014)  Received Amazon AWS in Education Coursework Grants (July 2012 - July 2013, Jan 2011 - Dec 2011  Partnership  Received Academic Education Partnership with Cloudera since June 2012  Linked with Hortonworks since May 2013 – Positive to provide partnership
  6. 6. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Certificate  Certificate of Achievement in the Big Data University Training Course, “Hadoop Fundamentals I”, July 8 2012  Certificate of 10gen Training Course, “M101: MongoDB Development”, (Dec 24 2012)  Blog and Github for Hadoop and its ecosystems  http://dal-cloudcomputing.blogspot.com/ – Hadoop, AWS, Cloudera  https://github.com/hipic – Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming, RHadoop  https://github.com/dalgual
  7. 7. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Several publications regarding Hadoop and NoSQL  “Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA 2012, Las Vegas (July 16-19, 2012)  “Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho Kim, EDB 2012, Incheon, Aug. 25-27, 2011  “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011, Las Vegas (July 18-21, 2011)  Jongwook Woo, “Introduction to Cloud Computing”, in the 10th KOCSEA Technical Symposium, UNLV, Dec 18 - 19, 2009  Talks in Korean Universities and companies  Yonsei, Sookmyung, KAIST, Korean Polytech Univ – Winter 2011  VanillaBreeze – Winter 2011
  8. 8. High Performance Information Computing Center Jongwook Woo CSULA What is Big Data, Map/Reduce, Hadoop, NoSQL DB on Cloud Computing
  9. 9. High Performance Information Computing Center Jongwook Woo CSULA Data Google “We don’t have a better algorithm than others but we have more data than others”
  10. 10. High Performance Information Computing Center Jongwook Woo CSULA Use Cases in Korea SK Telecomm Seoul Credit Cards Hyundai Motors
  11. 11. High Performance Information Computing Center Jongwook Woo CSULA SK Telecomm T Map  Collect GPS traffic data from Taxi, Bus, Rental Car – Every 5 mins. Traffic data from 50,000 cars  Tell the quickest directions to the destination
  12. 12. High Performance Information Computing Center Jongwook Woo CSULA Seoul Night Bus  Collect GPS traffic data from Taxi  Find out the most frequent traffics –Build Bus lines in the night
  13. 13. High Performance Information Computing Center Jongwook Woo CSULA Credit Cards Apps to find out popular restaurants Collect customers behavior, which occurred using the cards at the restaurants Based on Logic: Frequency to visit the same restaurants in 3 months Show the popular restaurants Credit Cards for Gas Station discount Using a card at a gas station that does not provide discounts Sell a new card that gives a discount at any station
  14. 14. High Performance Information Computing Center Jongwook Woo CSULA Hyundai Motors Improve the present and future models Collect drivers’ behavior and the status of the cars Collect any errors in the car
  15. 15. High Performance Information Computing Center Jongwook Woo CSULA Use Cases President Election Amazon AWS HuffPOst | AOL
  16. 16. High Performance Information Computing Center Jongwook Woo CSULA President Election People Behavior Analysis Collect people’s data of Credit card usages, Car models, Newspapers to read, Facebook, Twitter For example, pro-environmental Campaign for – Mom • who sends the kids to the public school, • who twits about Organic foods,
  17. 17. High Performance Information Computing Center Jongwook Woo CSULA HuffPost | AOL [10] Two Machine Learning Use Cases Comment Moderation –Evaluate All New HuffPost User Comments Every Day • Identify Abusive / Aggressive Comments • Auto Delete / Publish ~25% Comments Every Day Article Classification –Tag Articles for Advertising • E.g.: scary, salacious, …
  18. 18. High Performance Information Computing Center Jongwook Woo CSULA HuffPost | AOL [10] Parallelize on Hadoop Good news: – Mahout, a parallel machine learning tool, is already available. – There are Mallet, libsvm, Weka, … that support necessary algorithms. Bad news: – Mahout doesn’t support necessary algorithms yet. – Other algorithms do not run natively on Hadoop. build a flexible ML platform running on Hadoop Pig for Hadoop implementation.
  19. 19. High Performance Information Computing Center Jongwook Woo CSULA Others amazon.com Recommend books to the people Google Find out influenza much earlier – by analyzing the area under influenza Translator – by analyzing the data from many people Siri of Apple Natural Language Processing from many data of people
  20. 20. High Performance Information Computing Center Jongwook Woo CSULA New Data Trend Sparsity Unstructured Schema free data with sparse attributes – Semantic or social relations No relational property – nor complex join queries • Log data Immutable No need to update and delete data
  21. 21. High Performance Information Computing Center Jongwook Woo CSULA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data, Bioinformatics, Social Computing, smart phone, online game… Cannot handle with the legacy approach Too big Un-/Semi-structured data Too expensive Need new systems Non-expensive
  22. 22. High Performance Information Computing Center Jongwook Woo CSULA Two Cores in Big Data How to store Big Data NoSQL DB How to compute Big Data Parallel Computing with multiple non- expensive computers –Own super computers
  23. 23. High Performance Information Computing Center Jongwook Woo CSULA Big Data for RDBMS Issues in RDBMS Hard to scale – Relation gets broken • Partitioning for scalability • Replication for availability Speed – The Seek times of physical storage • Slower than N/W speed • 1TB disk: 10Mbps transfer rate – 100K sec =>27.8 hrs – With Multiple data sources at difference places • 100 10GB disks: each 10Mbps transfer rate – 1K sec =>16.7min
  24. 24. High Performance Information Computing Center Jongwook Woo CSULA Big Data for RDBMS (Cont’d) Issues in RDBMS (Cont’d) Data Integration –Not good for un-/semi-structured data • Many unstructured data –Web or log data etc RDB not good in parallelization –Cannot split 1000 tasks to non-expensive 1000 PCs efficiently
  25. 25. High Performance Information Computing Center Jongwook Woo CSULA RDBMS Issues Solution  Before: Data Warehouse  Now and future: Big Data Hadoop framework Data Computation (MapReduce, Pig) Data Repositories (NoSQL DB: HBase, Cassandra, MongoDB) Business Intelligence (Data Mining, OLAP, Data Visualization, Reporting): Hive, Mahout
  26. 26. High Performance Information Computing Center Jongwook Woo CSULA Big Data Definition  Systems that supports a non- expensive platform to store and compute large scale, non- /semi-structured data
  27. 27. High Performance Information Computing Center Jongwook Woo CSULA Use Cases for NoSQL DB [1] RDBMS replacement for high-traffic web applications Semi-structured content management Real-time analytics & high-speed logging Web Infrastructure Web 2.0, Media, SaaS, Gaming, Finance, Telecom, Healthcare, Government Three NoSQL DB Approaches Key/Value, Column, Document
  28. 28. High Performance Information Computing Center Jongwook Woo CSULA Data Store of NoSQL DB Key/Value store (Key, Value) Functions – Index, versioning, sorting, locking, transaction, replication Apache Cassandra, Memcached
  29. 29. High Performance Information Computing Center Jongwook Woo CSULA Data Store of NoSQL DB (Cont’d) Column-Oriented Stores (Extensible Record Stores) stores data tables as sections of columns of data – rather than as rows of data, like most RDBMS • Sparse fields in RDBMS – well-suited for OLAP-like workloads (e.g., data warehouses) Extensible record horizontally and vertically partitioned across nodes – Rows and Columns are distributed over multiple nodes BigTable, HBase, Cassandra, Hypertable
  30. 30. High Performance Information Computing Center Jongwook Woo CSULA Data Store of NoSQL DB (Cont’d)  Row Oriented – 1,Smith, Joe, smith@hi.com; – 2,Jones, Mary, mary@hi.com; – 3,Johnson, Cathy, cathy@hi.com;  Column Oriented – 1,2,3; – Smith, Jones, Johnson; – Joe, Mary, Cathy; – smith@hi.com, mary@hi.com, cathy@hi.com; StudentId Lastname Firstname email 1 Smith Joe smith@hi.com 2 Jones Mary mary@hi.com 3 Johnson Cathy cathy@hi.com
  31. 31. High Performance Information Computing Center Jongwook Woo CSULA HBase Schema Example (Student/Course)  RDBMS  Students: (id, name, sex, age)  Courses: (id, title, desc, teacher_id)  S_C: (s_id, c_id, type)  HBase Column Families id Info: Course <student_id> Info:name Info:sex Info:age Course:<course_id>= type Column Families id Info: student <course_id> Info:title Info:desc Info:teacher_id student:<student_id> =type
  32. 32. High Performance Information Computing Center Jongwook Woo CSULA Data Store of NoSQL DB (Cont’d) Document Store Collections and Documents – vs Tables and Records of RDB Used in Search Engine/Repository Multiple index to store indexed document – no fixed fields Not simple key-value lookup – Use API Functions – No locking, Replication, Transaction MongoDB, CouchDB, ThruDB, SimpleDB
  33. 33. High Performance Information Computing Center Jongwook Woo CSULA Understanding the Document Model [1] { _id:“A4304” author: “nosh”, date: 22/6/2010, title: “Intro to MongoDB” text: “MongoDB is an open source..”, tags: [“webinar”, “opensource”] comments: [{author: “mike”, date: 11/18/2010, txt: “Did you see the…”, votes: 7},….] } Documents->Collections->Databases
  34. 34. High Performance Information Computing Center Jongwook Woo CSULA Document Model Makes Queries Simple [1] Operators: $gt, $lt, $gte, $lte, $ne, $all, $in, $nin, count, limit, skip, group Example: db.posts.find({author: “nosh”, tags: “webinar”})
  35. 35. High Performance Information Computing Center Jongwook Woo CSULA Selected Users [1]
  36. 36. High Performance Information Computing Center Jongwook Woo CSULA The Great Divide [1] MongoDB sweet spot: Easy, Flexible, Scalable HBase MongoDB
  37. 37. High Performance Information Computing Center Jongwook Woo CSULA Solutions in Big Data Computation  Map/Reduce by Google (Key, Value) parallel computing  Apache Hadoop  Big Data Data Computation (MapReduce, Pig)  Integrating MapReduce and RDB Oracle + Hadoop Sybase IQ Vertica + Hadoop Hadoop DB Greenplum Aster Data  Integrating MapReduce and NoSQL DB MongoDB MapReduce HBase
  38. 38. High Performance Information Computing Center Jongwook Woo CSULA Apache Hadoop  Motivated by Google Map/Reduce and GFS  open source project of the Apache Foundation.  framework written in Java – originally developed by Doug Cutting • who named it after his son's toy elephant.  Two core Components  Storage: HDFS – High Bandwidth Clustered storage  Processing: Map/Reduce – Fault Tolerant Distributed Processing  Hadoop scales linearly with  data size  Analysis complexity
  39. 39. High Performance Information Computing Center Jongwook Woo CSULA Hadoop issues Map/Reduce is not DB Algorithm in Restricted Parallel Computing HDFS and HBase Cannot compete with the functions in RDBMS But, useful for Useful for huge (peta- or Terra-bytes) but non- complicated data – Web crawling – log analysis • Log file for web companies – New York Times case
  40. 40. High Performance Information Computing Center Jongwook Woo CSULA MapReduce Pros & Cons Summary Good when Huge data for input, intermediate, output A few synchronization required Read once; batch oriented datasets (ETL) Bad for Fast response time Large amount of shared data Fine-grained synch needed CPU-intensive not data-intensive Continuous input stream
  41. 41. High Performance Information Computing Center Jongwook Woo CSULA MapReduce in Detail Functions borrowed from functional programming languages (eg. Lisp) Provides Restricted parallel programming model on Hadoop User implements Map() and Reduce() Libraries (Hadoop) take care of EVERYTHING else –Parallelization –Fault Tolerance –Data Distribution –Load Balancing
  42. 42. High Performance Information Computing Center Jongwook Woo CSULA Map Convert input data to (key, value) pairs map() functions run in parallel,  creating different intermediate (key, value) values from different input data sets
  43. 43. High Performance Information Computing Center Jongwook Woo CSULA Reduce reduce() combines those intermediate values into one or more final values for that same key reduce() functions also run in parallel, each working on a different output key Bottleneck: reduce phase can’t start until map phase is completely finished.
  44. 44. High Performance Information Computing Center Jongwook Woo CSULA Training in Big Data  Learn by yourself? Miss many important topics Two main: –Cloudera, Hortonworks • With hands-on exercises Cloudera 강의 교재 간단히 소개 Especially MapReduce example
  45. 45. High Performance Information Computing Center Jongwook Woo CSULA Example: Sort URLs in the largest hit order Compute the largest hit URLs Stored in log files Map() Input <logFilename, file text> Output: Parses file and emits <url, hit counts> pairs – eg. <http://hello.com, 1> Reduce() Input: <url, list of hit counts> from multiple map nodes Output: Sums all values for the same key and emits <url, TotalCount> – eg.<http://hello.com, (3, 5, 2, 7)> => <http://hello.com, 17>
  46. 46. High Performance Information Computing Center Jongwook Woo CSULA Map/Reduce for URL visits … …Map1() Map2() Mapm() Reduce1 () Reducel() Data Aggregation/Combine (http://hi.com, <1, 1, …, 1>) (http://hello.com, <3, 5, 2, 7>) (http://hi.com, 32) (http://hello.com, 17) Input Log Data Reduce2() (http://hi.com, 1) (http://hello.com, 3) … (http://halo.com, 1) (http://hello.com, 5) … (http://halo.com, <1, 5,>) (http://halo.com, 6)
  47. 47. High Performance Information Computing Center Jongwook Woo CSULA Legacy Example In late 2007, the New York Times wanted to make available over the web its entire archive of articles, 11 million in all, dating back to 1851. four-terabyte pile of images in TIFF format. needed to translate that four-terabyte pile of TIFFs into more web-friendly PDF files. – not a particularly complicated but large computing chore, • requiring a whole lot of computer processing time.
  48. 48. High Performance Information Computing Center Jongwook Woo CSULA Legacy Example (Cont’d) In late 2007, the New York Times wanted to make available over the web its entire archive of articles, a software programmer at the Times, Derek Gottfrid, – playing around with Amazon Web Services, Elastic Compute Cloud (EC2), • uploaded the four terabytes of TIFF data into Amazon's Simple Storage System (S3) • In less than 24 hours, 11 millions PDFs, all stored neatly in S3 and ready to be served up to visitors to the Times site.  The total cost for the computing job? $240 – 10 cents per computer-hour times 100 computers times 24 hours
  49. 49. High Performance Information Computing Center Jongwook Woo CSULA Supporters of Big Data: Hadoop Ecosystems  Apache Hadoop Supporters  Cloudera – Like Linux and Redhat – HiPIC is an Academic Partner  Hortonworks – Pig, – Consulting and training  Facebook – Hive  IBM – Jaql  NoSQL DB supporters  MongoDB  HBase, CouchDB, Apache Cassandra (originally by FB) etc
  50. 50. High Performance Information Computing Center Jongwook Woo CSULA Pig • developed at Yahoo Research around 2006 o moved into the Apache Software Foundation in 2007. • PigLatin, o Pig's language o a data flow language o well suited to processing unstructured data  Unlike SQL, not require that the data have a schema  However, can still leverage the value of a schema
  51. 51. High Performance Information Computing Center Jongwook Woo CSULA Hive • developed at Facebook o turns Hadoop into a data warehouse o complete with a dialect of SQL for querying. • HiveQL o a declarative language (SQL dialect) • Difference from PigLatin, o you do not specify the data flow,  but instead describe the result you want  Hive figures out how to build a data flow to achieve it. o a schema is required,  but not limited to one schema. o data can have many schemas
  52. 52. High Performance Information Computing Center Jongwook Woo CSULA Hive (Cont'd) • Similarity with PigLatin and SQL, o HiveQL on its own is a relationally complete language  but not a Turing complete language,  That can express any computation o can be extended through UDFs (User Defined Functions) of Java  just like Pig to be Turing complete
  53. 53. High Performance Information Computing Center Jongwook Woo CSULA Jaql • developed at IBM. • a data flow language o its native data structure format is JSON (JavaScript Object Notation). • Schemas are optional • Turing complete on its own o without the need for extension through UDFs.
  54. 54. High Performance Information Computing Center Jongwook Woo CSULA MapReduce Cons and Future Bad for Fast response time Large amount of shared data Fine-grained synch needed CPU-intensive not data-intensive Continuous input stream Hadoop 2.0: YARN Not a product yet but will be soon
  55. 55. High Performance Information Computing Center Jongwook Woo CSULA Hadoop 2.0: YARN Data processing applications and services Online Serving – HOYA (HBase on YARN) Real-time event processing – Storm, S4, other commercial platforms Tez – Generic framework to run a complex DAG  MPI: OpenMPI, MPICH2  Master-Worker  Machine Learning: Spark  Graph processing: Giraph  Enabled by allowing the use of paradigm-specific application master [http://www.slideshare.net/hortonworks/apache- hadoop-yarn-enabling-nex]
  56. 56. High Performance Information Computing Center Jongwook Woo CSULA Big Data Supporters Amazon AWS Facebook Twitter Craiglist
  57. 57. High Performance Information Computing Center Jongwook Woo CSULA Amazon AWS amazon.com Consumer and seller business aws.amazon.com IT infrastructure business – Focus on your business not IT management Pay as you go Services with many APIs – S3: Simple Storage Service – EC2: Elastic Compute Cloud • Provide many virtual Linux servers • Can run on multiple nodes – Hadoop and HBase – MongoDB
  58. 58. High Performance Information Computing Center Jongwook Woo CSULA Amazon AWS (Cont’d) Customers on aws.amazon.com Samsung – Smart TV hub sites: TV applications are on AWS Netflix – ~25% of US internet traffic – ~100% on AWS NASA JPL – Analyze more than 200,000 images NASDAQ – Using AWS S3 HiPIC received research and teaching grants from AWS
  59. 59. High Performance Information Computing Center Jongwook Woo CSULA Facebook [7] Using Apache HBase  For Titan and Puma – Message Services – ETL  HBase for FB – Provide excellent write performance and good reads – Nice features • Scalable • Fault Tolerance • MapReduce
  60. 60. High Performance Information Computing Center Jongwook Woo CSULA Titan: Facebook Message services in FB Hundreds of millions of active users 15+ billion messages a month 50K instant message a second Challenges High write throughput – Every message, instant message, SMS, email Massive Clusters – Must be easily scalable Solution Clustered HBase
  61. 61. High Performance Information Computing Center Jongwook Woo CSULA Puma: Facebook  ETL  Extract, Transform, Load – Data Integrating from many data sources to Data Warehouse  Data analytics – Domain owners’ web analytics for Ad and apps • clicks, likes, shares, comments etc  ETL before Puma  8 – 24 hours – Procedures: Scribe, HDFS, Hive, MySQL  ETL after Puma  Puma – Real time MapReduce framework  2 – 30 secs – Procedures: Scribe, HDFS, Puma, HBase
  62. 62. High Performance Information Computing Center Jongwook Woo CSULA Twitter [8] Three Challenges Collecting Data – Scribe as FB Large Scale Storage and analysis – Cassandra: ColumnFamily key-value store – Hadoop Rapid Learning over Big Data – Pig • 5% of Java code • 5% of dev time • Within 20% of running time
  63. 63. High Performance Information Computing Center Jongwook Woo CSULA Craiglist in MongoDB [9] Craiglist ~700 cities, worldwide ~1 billion hits/day ~1.5 million posts/day Servers – ~500 servers – ~100 MySQL servers Migrate to MongoDB Scalable, Fast, Proven, Friendly
  64. 64. High Performance Information Computing Center Jongwook Woo CSULA Hadoop Streaming  Hadoop MapReduce for Non-Java codes: Python, Ruby  Requirement  Running Hadoop  Needs Hadoop Streaming API – hadoop-streaming.jar  Needs to build Mapper and Reducer codes – Simple conversion from sequential codes  STDIN > mapper > reducer > STDOUT
  65. 65. High Performance Information Computing Center Jongwook Woo CSULA Hadoop Streaming  MapReduce Python execution  http://wiki.apache.org/hadoop/HadoopStreaming  Sysntax $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar [options] Options: -input <path> DFS input file(s) for the Map step -output <path> DFS output directory for the Reduce step -mapper <cmd|JavaClassName> The streaming command to run -reducer <cmd|JavaClassName> The streaming command to run -file <file> File/dir to be shipped in the Job jar file  Example $ bin/hadoop jar contrib/streaming/hadoop-streaming.jar -file /home/jwoo/mapper.py -mapper /home/jwoo/mapper.py -file /home/jwoo/reducer.py -reducer /home/jwoo/reducer.py -input /user/jwoo/shakespeare/* -output /user/jwoo/shakespeare- output
  66. 66. High Performance Information Computing Center Jongwook Woo CSULA Conclusion  Era of Big Data  Need to store and compute Big Data  Many solutions but Hadoop  Storage: NoSQL DB  Computation: Hadoop MapRedude  Need to analyze Big Data in mobile computing, SNS for Ad, User Behavior, Patterns …
  67. 67. High Performance Information Computing Center Jongwook Woo CSULA Question?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×