Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction To Big Data and Use Cases using Hadoop

681 views

Published on

Introduction to Hadoop and Yarn with Use Cases at designing Chip and Semiconductor

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Introduction To Big Data and Use Cases using Hadoop

  1. 1. Jongwook Woo HiPIC CSULA ENC Lab Hanyang University Seoul, Korea Aug 19th 2014 Jongwook Woo (PhD) High-Performance Information Computing Center (HiPIC) Cloudera Academic Partner and Grants Awardee of Amazon AWS California State University Los Angeles Introduction To Big Data and Use Cases using Hadoop
  2. 2. High Performance Information Computing Center Jongwook Woo CSULA Contents 소개  Emerging Big Data Technology  Big Data Use Cases  Hadoop 2.0  Training in Big Data
  3. 3. High Performance Information Computing Center Jongwook Woo CSULA Me  이름: 우종욱  직업:  교수 (직책: 부교수), California State University Los Angeles – Capital City of Entertainment  경력:  2002년 부터 교수: Computer Information Systems Dept, College of Business and Economics – www.calstatela.edu/faculty/jwoo5  1998년부터 헐리우드등지의 많은 회사 컨설팅 – 주로 J2EE 미들웨어를 이용한 eBusiness applications 구축 – FAST, Lucene/Solr, Sphinx 검색엔진을 이용한 정보추출, 정보통합 – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등  2009여년 부터 하둡 빅데이타에 관심
  4. 4. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Grants  Received MicroSoft Windows Azure Educator Grant (Oct 2013 - July 2014)  Received Amazon AWS in Education Research Grant (July 2012 - July 2014)  Received Amazon AWS in Education Coursework Grants (July 2012 - July 2013, Jan 2011 - Dec 2011  Partnership  Received Academic Education Partnership with Cloudera since June 2012  Linked with Hortonworks since May 2013 – Positive to provide partnership
  5. 5. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Certificate  Certified Cloudera Instructor  Certified Cloudera Hadoop Developer / Administrator  Certificate of Achievement in the Big Data University Training Course, “Hadoop Fundamentals I”, July 8 2012  Certificate of 10gen Training Course, “M101: MongoDB Development”, (Dec 24 2012)  Blog and Github for Hadoop and its ecosystems  http://dal-cloudcomputing.blogspot.com/ – Hadoop, AWS, Cloudera  https://github.com/hipic – Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming, RHadoop  https://github.com/dalgual
  6. 6. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Several publications regarding Hadoop and NoSQL  Deeksha Lakshmi, Iksuk Kim, Jongwook Woo, “Analysis of MovieLens Data Set using Hive”, in Journal of Science and Technology, Dec 2013, Vol3 no12, pp1194-1198, ARPN  “Scalable, Incremental Learning with MapReduce Parallelization for Cell Detection in High-Resolution 3D Microscopy Data”. Chul Sung, Jongwook Woo, Matthew Goodman, Todd Huffman, and Yoonsuck Choe. in Proceedings of the International Joint Conference on Neural Networks, 2013  “Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA 2012, Las Vegas (July 16-19, 2012)  “Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho Kim, EDB 2012, Incheon, Aug. 25-27, 2011  “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011, Las Vegas (July 18-21, 2011)  Collaboration with Universities and companies  USC, Texas A&M, Cloudera, Amazon, MicroSoft
  7. 7. High Performance Information Computing Center Jongwook Woo CSULA What is Big Data, Map/Reduce, Hadoop, NoSQL DB on Cloud Computing
  8. 8. High Performance Information Computing Center Jongwook Woo CSULA Data Google “We don’t have a better algorithm than others but we have more data than others”
  9. 9. High Performance Information Computing Center Jongwook Woo CSULA New Data Trend Sparsity Unstructured Schema free data with sparse attributes – Semantic or social relations No relational property – nor complex join queries • Log data Immutable No need to update and delete data
  10. 10. High Performance Information Computing Center Jongwook Woo CSULA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data, Bioinformatics, Social Computing, smart phone, online game… Cannot handle with the legacy approach Too big Un-/Semi-structured data Too expensive Need new systems Non-expensive
  11. 11. High Performance Information Computing Center Jongwook Woo CSULA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – On inexpensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with multiple non-expensive computers • Own super computers
  12. 12. High Performance Information Computing Center Jongwook Woo CSULA Hadoop 1.0 Hadoop Doug Cutting – 하둡 창시자 – 아파치 Lucene, Nutch, Avro, 하둡 프로젝트의 창시자 – 아파치 소프트웨어 파운데이션의 보드 멤버 – Chief Architect at Cloudera MapReduce HDFS Restricted Parallel Programming – Not for iterative algorithms – Not for graph
  13. 13. High Performance Information Computing Center Jongwook Woo CSULA MapReduce in Detail Functions borrowed from functional programming languages (eg. Lisp) Provides Restricted parallel programming model on Hadoop User implements Map() and Reduce() Libraries (Hadoop) take care of EVERYTHING else –Parallelization –Fault Tolerance –Data Distribution –Load Balancing
  14. 14. High Performance Information Computing Center Jongwook Woo CSULA Map Convert input data to (key, value) pairs map() functions run in parallel,  creating different intermediate (key, value) values from different input data sets
  15. 15. High Performance Information Computing Center Jongwook Woo CSULA Reduce reduce() combines those intermediate values into one or more final values for that same key reduce() functions also run in parallel, each working on a different output key Bottleneck: reduce phase can’t start until map phase is completely finished.
  16. 16. High Performance Information Computing Center Jongwook Woo CSULA Example: Sort URLs in the largest hit order Compute the largest hit URLs Stored in log files Map() Input <logFilename, file text> Output: Parses file and emits <url, hit counts> pairs – eg. <http://hello.com, 1> Reduce() Input: <url, list of hit counts> from multiple map nodes Output: Sums all values for the same key and emits <url, TotalCount> – eg.<http://hello.com, (3, 5, 2, 7)> => <http://hello.com, 17>
  17. 17. High Performance Information Computing Center Jongwook Woo CSULA Map/Reduce for URL visits … …Map1() Map2() Mapm() Reduce1 () Reducel() Data Aggregation/Combine (http://hi.com, <1, 1, …, 1>) (http://hello.com, <3, 5, 2, 7>) (http://hi.com, 32) (http://hello.com, 17) Input Log Data Reduce2() (http://hi.com, 1) (http://hello.com, 3) … (http://halo.com, 1) (http://hello.com, 5) … (http://halo.com, <1, 5,>) (http://halo.com, 6)
  18. 18. High Performance Information Computing Center Jongwook Woo CSULA© Hortonworks Inc. 2013 - Confidential Hadoop 1 Architecture [1] JobTracker Manage Cluster Resources & Job Scheduling TaskTracker Per-node agent Manage Tasks Page 18
  19. 19. High Performance Information Computing Center Jongwook Woo CSULA MapReduce 1.0 Cons and Future Bad for  Fast response time  Large amount of shared data  Fine-grained synch needed  CPU-intensive not data-intensive  Continuous input stream Hadoop 2.0: YARN product
  20. 20. High Performance Information Computing Center Jongwook Woo CSULA© Hortonworks Inc. 2013 - Confidential Hadoop 1 Limitations [1] Lacks Support for Alternate Paradigms and Services Force everything needs to look like Map Reduce Iterative applications in MapReduce are 10x slower Scalability Max Cluster size ~5,000 nodes Max concurrent tasks ~40,000 Availability Failure Kills Queued & Running Jobs Hard partition of resources into map and reduce slots Non-optimal Resource Utilization Page 20
  21. 21. High Performance Information Computing Center Jongwook Woo CSULA Hadoop as Next-Gen Platform [1] HADOOP 1.0 HDFS (redundant, reliable storage) MapReduce (cluster resource management & data processing) HDFS2 (redundant, highly-available & reliable storage) YARN (cluster resource management) MapReduce (data processing) Others HADOOP 2.0 Single Use System Batch Apps Multi Purpose Platform Batch, Interactive, Online, Streaming, … Page 21
  22. 22. High Performance Information Computing Center Jongwook Woo CSULA© Hortonworks Inc. 2013 - Confidential Page 22 Hadoop 2 - YARN Architecture ResourceManager (RM) NodeManager (NM)Per-Node ApplicationMaster (AM) Per-Application – Manages application lifecycle and task scheduling Resource Manager MapReduce Status Job Submission Client Node Manager Node Manager Container Node Manager App Mstr Node Status Resource Request
  23. 23. High Performance Information Computing Center Jongwook Woo CSULA Hadoop 2.0: YARN [2]  Data processing applications and services  Online Serving – HOYA (HBase on YARN)  Real-time event processing – Storm, S4, other commercial platforms  Tez – Generic framework to run a complex DAG  MPI: OpenMPI, MPICH2  Master-Worker  Machine Learning: Spark  Graph processing: Giraph  Impala: Interactive SQL  Enabled by allowing the use of paradigm-specific application master
  24. 24. High Performance Information Computing Center Jongwook Woo CSULA Josh Wills (Cloudera)  “I have found that many kinds of scientists– such as astronomers, geneticists, and geophysicists– are working with very large data sets in order to build models that do not involve statistics or machine learning, and that these scientists encounter data challenges that would be familiar to data scientists at Facebook, Twitter, and LinkedIn.”  “Data science is a set of techniques used by many scientists to solve problems across a wide array of scientific fields.”
  25. 25. High Performance Information Computing Center Jongwook Woo CSULA Legacy Example In late 2007, the New York Times wanted to make available over the web its entire archive of articles, 11 million in all, dating back to 1851. four-terabyte pile of images in TIFF format. needed to translate that four-terabyte pile of TIFFs into more web-friendly PDF files. – not a particularly complicated but large computing chore, • requiring a whole lot of computer processing time.
  26. 26. High Performance Information Computing Center Jongwook Woo CSULA Legacy Example (Cont’d) In late 2007, the New York Times wanted to make available over the web its entire archive of articles, a software programmer at the Times, Derek Gottfrid, – playing around with Amazon Web Services, Elastic Compute Cloud (EC2), • uploaded the four terabytes of TIFF data into Amazon's Simple Storage System (S3) • In less than 24 hours, 11 millions PDFs, all stored neatly in S3 and ready to be served up to visitors to the Times site.  The total cost for the computing job? $240 – 10 cents per computer-hour times 100 computers times 24 hours
  27. 27. High Performance Information Computing Center Jongwook Woo CSULA HuffPost | AOL Two Machine Learning Use Cases Comment Moderation  Evaluate All New HuffPost User Comments Every Day  Identify Abusive / Aggressive Comments  Auto Delete / Publish ~25% Comments Every Day Article Classification  Tag Articles for Advertising  E.g.: scary, salacious, …
  28. 28. High Performance Information Computing Center Jongwook Woo CSULA Use Cases experienced Log Analysis  Log files from IPS and IDS – 1.5GB per day for each systems  Extracting unusual cases using Hadoop, Solr, Flume on Cloudera Customer Behavior Analysis Market Basket Analysis Algorithm  Machine Learning for Image Processing with Texas A&M Hadoop Streaming API  Movie Data Analysis  Hive, Impala
  29. 29. High Performance Information Computing Center Jongwook Woo CSULA Use Cases: Chip Design and Seminconductor Intel Using Hadoop – to gather historical information during manufacturing • and combine new sources of information that had previously been too unmanageable to use – a small team of five people was able • to slash $3 million off the cost of testing just one line of Intel Core processors in 2012. Intel IT expects to realize an additional $30 million in cost avoidance in 2013-2014.
  30. 30. High Performance Information Computing Center Jongwook Woo CSULA Use Cases: Chip Design and Seminconductor AMD Using a Hadoop implementation, able to cut the work of employees checking semiconductor wafer quality data by 90 percent – by catching faulty product batches earlier in production. Samsung use of a Hadoop file system for handling its data warehouse – made analytics processing 10 times faster on 75 percent as much computing power, • even as data sets grew 10 times larger
  31. 31. High Performance Information Computing Center Jongwook Woo CSULA How for Hadoop and Ecosystems Need R&D by University University should own Hadoop Cluster – It is an inexpensive super computer Possible to have – Big Data R&D, training, RFP We may predict many – Semiconductor Data Analysis – System Chip Design Data Analysis
  32. 32. High Performance Information Computing Center Jongwook Woo CSULA How for Hadoop and Ecosystems Training program Self-study – Takes time: more than a year to be an expert – Don’t know the detail – Miss many important topics Cloudera – $2,000, Hands-on Exercises – About Hadoop, Hbase, Hive/Pig, Data Analysis, Spark, Data Mining etc • 하둡 개발자 • 하둡 시스템관리자 • 하둡 데이터 분석가/과학자 • 하둡 Spark
  33. 33. High Performance Information Computing Center Jongwook Woo CSULA How for Hadoop and Ecosystems Training program (Cont’d) Educational Partnership with Cloudera – One of initiators to launch Cloudera Academic Programs • Teach Big Data at CSULA – http://www.cloudera.com/content/cloudera/en/our- customers/csula-academic-partnership.html – Training ppl at Samsung, Other Small companies in Korea using Cloudera’s material
  34. 34. High Performance Information Computing Center Jongwook Woo CSULA Conclusion Era of Big Data Need to store and compute Big Data Many solutions but Hadoop Hadoop is supercomputer that you can own Hadoop 2.0 Blue Ocean for Publications, RFP Training is important
  35. 35. High Performance Information Computing Center Jongwook Woo CSULA Question?
  36. 36. High Performance Information Computing Center Jongwook Woo CSULA References 1. YARN Apache Hadoop Next Generation Compute Platform, Bikas Saha 2. Apache Hadoop Yarn Enabling Next Generation 1. http://www.slideshare.net/hortonworks/apache- hadoop-yarn-enabling-nex

×