Uploaded on

 

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
226
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
6
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. jwoo Woo HiPIC CSULA Big Data and Data Intensive Computing: Education and Training Graduate School of Communication & Art Yonsei University Shinchon, Korea Sept 5th 2013 Jongwook Woo (PhD) High-Performance Information Computing Center (HiPIC) Educational Partner with Cloudera and Grants Awardee of Amazon AWS Computer Information Systems Department California State University, Los Angeles
  • 2. High Performance Information Computing Center Jongwook Woo CSULA Contents 소개  Big Data Use Cases  Data Issues  Big Data  Data-Intensive Computing: Hadoop  Training in Big Data  Big Data Supporters
  • 3. High Performance Information Computing Center Jongwook Woo CSULA Me  이름: 우종욱  직업:  교수 (직책: 부교수), California State University Los Angeles – Capital City of Entertainment  경력:  2002년 부터 교수: Computer Information Systems Dept, College of Business and Economics – www.calstatela.edu/faculty/jwoo5  1998년부터 헐리우드등지의 많은 회사 컨설팅 – 주로 J2EE 미들웨어를 이용한 eBusiness applications 구축 – FAST, Lucene/Solr, Sphinx 검색엔진을 이용한 정보추출, 정보통합 – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등  2009여년 부터 하둡 빅데이타에 관심
  • 4. High Performance Information Computing Center Jongwook Woo CSULA Me 경력 (계속): 2013년 여름 현재 IglooSecurity 자문중: – Hadoop 및 그 Ecosystems 교육 – 하루에 30GB – 100GB씩 생성되는 보안관련 로그 파일들을 빠르게 데이타 검색하는 시스템 R&D • Hadoop, Solr, Java, Cloudera 이용 2013년 9월 중순: 삼성 종합 기술원 – 3일간 Hadoop 및 그 Ecosystems 교육 예정 – Using Cloudera material in Korea as far as I know
  • 5. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Grants  Received Amazon AWS in Education Research Grant (July 2012 - July 2014)  Received Amazon AWS in Education Coursework Grants (July 2012 - July 2013, Jan 2011 - Dec 2011  Partnership  Received Academic Education Partnership with Cloudera since June 2012  Linked with Hortonworks since May 2013 – Positive to provide partnership
  • 6. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Certificate  Certificate of Achievement in the Big Data University Training Course, “Hadoop Fundamentals I”, July 8 2012  Certificate of 10gen Training Course, “M101: MongoDB Development”, (Dec 24 2012)  Blog and Github for Hadoop and its ecosystems  http://dal-cloudcomputing.blogspot.com/ – Hadoop, AWS, Cloudera  https://github.com/hipic – Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming, RHadoop  https://github.com/dalgual
  • 7. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Several publications regarding Hadoop and NoSQL  “Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA 2012, Las Vegas (July 16-19, 2012)  “Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho Kim, EDB 2012, Incheon, Aug. 25-27, 2011  “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011, Las Vegas (July 18-21, 2011)  Jongwook Woo, “Introduction to Cloud Computing”, in the 10th KOCSEA Technical Symposium, UNLV, Dec 18 - 19, 2009  Talks in Korean Universities and companies  Yonsei, Sookmyung, KAIST, Korean Polytech Univ – Winter 2011  VanillaBreeze – Winter 2011
  • 8. High Performance Information Computing Center Jongwook Woo CSULA What is Big Data, Map/Reduce, Hadoop, NoSQL DB on Cloud Computing
  • 9. High Performance Information Computing Center Jongwook Woo CSULA Data Google “We don’t have a better algorithm than others but we have more data than others”
  • 10. High Performance Information Computing Center Jongwook Woo CSULA Use Cases in Korea SK Telecomm Seoul Credit Cards Hyundai Motors
  • 11. High Performance Information Computing Center Jongwook Woo CSULA SK Telecomm T Map  Collect GPS traffic data from Taxi, Bus, Rental Car – Every 5 mins. Traffic data from 50,000 cars  Tell the quickest directions to the destination
  • 12. High Performance Information Computing Center Jongwook Woo CSULA Seoul Night Bus  Collect GPS traffic data from Taxi  Find out the most frequent traffics –Build Bus lines in the night
  • 13. High Performance Information Computing Center Jongwook Woo CSULA Credit Cards Apps to find out popular restaurants Collect customers behavior, which occurred using the cards at the restaurants Based on Logic: Frequency to visit the same restaurants in 3 months Show the popular restaurants Credit Cards for Gas Station discount Using a card at a gas station that does not provide discounts Sell a new card that gives a discount at any station
  • 14. High Performance Information Computing Center Jongwook Woo CSULA Hyundai Motors Improve the present and future models Collect drivers’ behavior and the status of the cars Collect any errors in the car
  • 15. High Performance Information Computing Center Jongwook Woo CSULA Use Cases President Election Amazon AWS HuffPOst | AOL
  • 16. High Performance Information Computing Center Jongwook Woo CSULA President Election People Behavior Analysis Collect people’s data of Credit card usages, Car models, Newspapers to read, Facebook, Twitter For example, pro-environmental Campaign for – Mom • who sends the kids to the public school, • who twits about Organic foods,
  • 17. High Performance Information Computing Center Jongwook Woo CSULA HuffPost | AOL [10] Two Machine Learning Use Cases Comment Moderation –Evaluate All New HuffPost User Comments Every Day • Identify Abusive / Aggressive Comments • Auto Delete / Publish ~25% Comments Every Day Article Classification –Tag Articles for Advertising • E.g.: scary, salacious, …
  • 18. High Performance Information Computing Center Jongwook Woo CSULA HuffPost | AOL [10] Parallelize on Hadoop Good news: – Mahout, a parallel machine learning tool, is already available. – There are Mallet, libsvm, Weka, … that support necessary algorithms. Bad news: – Mahout doesn’t support necessary algorithms yet. – Other algorithms do not run natively on Hadoop. build a flexible ML platform running on Hadoop Pig for Hadoop implementation.
  • 19. High Performance Information Computing Center Jongwook Woo CSULA Others amazon.com Recommend books to the people Google Find out influenza much earlier – by analyzing the area under influenza Translator – by analyzing the data from many people Siri of Apple Natural Language Processing from many data of people
  • 20. High Performance Information Computing Center Jongwook Woo CSULA New Data Trend Sparsity Unstructured Schema free data with sparse attributes – Semantic or social relations No relational property – nor complex join queries • Log data Immutable No need to update and delete data
  • 21. High Performance Information Computing Center Jongwook Woo CSULA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data, Bioinformatics, Social Computing, smart phone, online game… Cannot handle with the legacy approach Too big Un-/Semi-structured data Too expensive Need new systems Non-expensive
  • 22. High Performance Information Computing Center Jongwook Woo CSULA Two Cores in Big Data How to store Big Data NoSQL DB How to compute Big Data Parallel Computing with multiple non- expensive computers –Own super computers
  • 23. High Performance Information Computing Center Jongwook Woo CSULA Big Data for RDBMS Issues in RDBMS Hard to scale – Relation gets broken • Partitioning for scalability • Replication for availability Speed – The Seek times of physical storage • Slower than N/W speed • 1TB disk: 10Mbps transfer rate – 100K sec =>27.8 hrs – With Multiple data sources at difference places • 100 10GB disks: each 10Mbps transfer rate – 1K sec =>16.7min
  • 24. High Performance Information Computing Center Jongwook Woo CSULA Big Data for RDBMS (Cont’d) Issues in RDBMS (Cont’d) Data Integration –Not good for un-/semi-structured data • Many unstructured data –Web or log data etc RDB not good in parallelization –Cannot split 1000 tasks to non-expensive 1000 PCs efficiently
  • 25. High Performance Information Computing Center Jongwook Woo CSULA RDBMS Issues Solution  Before: Data Warehouse  Now and future: Big Data Hadoop framework Data Computation (MapReduce, Pig) Data Repositories (NoSQL DB: HBase, Cassandra, MongoDB) Business Intelligence (Data Mining, OLAP, Data Visualization, Reporting): Hive, Mahout
  • 26. High Performance Information Computing Center Jongwook Woo CSULA Big Data Definition  Systems that supports a non- expensive platform to store and compute large scale, non- /semi-structured data
  • 27. High Performance Information Computing Center Jongwook Woo CSULA Use Cases for NoSQL DB [1] RDBMS replacement for high-traffic web applications Semi-structured content management Real-time analytics & high-speed logging Web Infrastructure Web 2.0, Media, SaaS, Gaming, Finance, Telecom, Healthcare, Government Three NoSQL DB Approaches Key/Value, Column, Document
  • 28. High Performance Information Computing Center Jongwook Woo CSULA Data Store of NoSQL DB Key/Value store (Key, Value) Functions – Index, versioning, sorting, locking, transaction, replication Apache Cassandra, Memcached
  • 29. High Performance Information Computing Center Jongwook Woo CSULA Data Store of NoSQL DB (Cont’d) Column-Oriented Stores (Extensible Record Stores) stores data tables as sections of columns of data – rather than as rows of data, like most RDBMS • Sparse fields in RDBMS – well-suited for OLAP-like workloads (e.g., data warehouses) Extensible record horizontally and vertically partitioned across nodes – Rows and Columns are distributed over multiple nodes BigTable, HBase, Cassandra, Hypertable
  • 30. High Performance Information Computing Center Jongwook Woo CSULA Data Store of NoSQL DB (Cont’d)  Row Oriented – 1,Smith, Joe, smith@hi.com; – 2,Jones, Mary, mary@hi.com; – 3,Johnson, Cathy, cathy@hi.com;  Column Oriented – 1,2,3; – Smith, Jones, Johnson; – Joe, Mary, Cathy; – smith@hi.com, mary@hi.com, cathy@hi.com; StudentId Lastname Firstname email 1 Smith Joe smith@hi.com 2 Jones Mary mary@hi.com 3 Johnson Cathy cathy@hi.com
  • 31. High Performance Information Computing Center Jongwook Woo CSULA HBase Schema Example (Student/Course)  RDBMS  Students: (id, name, sex, age)  Courses: (id, title, desc, teacher_id)  S_C: (s_id, c_id, type)  HBase Column Families id Info: Course <student_id> Info:name Info:sex Info:age Course:<course_id>= type Column Families id Info: student <course_id> Info:title Info:desc Info:teacher_id student:<student_id> =type
  • 32. High Performance Information Computing Center Jongwook Woo CSULA Data Store of NoSQL DB (Cont’d) Document Store Collections and Documents – vs Tables and Records of RDB Used in Search Engine/Repository Multiple index to store indexed document – no fixed fields Not simple key-value lookup – Use API Functions – No locking, Replication, Transaction MongoDB, CouchDB, ThruDB, SimpleDB
  • 33. High Performance Information Computing Center Jongwook Woo CSULA Understanding the Document Model [1] { _id:“A4304” author: “nosh”, date: 22/6/2010, title: “Intro to MongoDB” text: “MongoDB is an open source..”, tags: [“webinar”, “opensource”] comments: [{author: “mike”, date: 11/18/2010, txt: “Did you see the…”, votes: 7},….] } Documents->Collections->Databases
  • 34. High Performance Information Computing Center Jongwook Woo CSULA Document Model Makes Queries Simple [1] Operators: $gt, $lt, $gte, $lte, $ne, $all, $in, $nin, count, limit, skip, group Example: db.posts.find({author: “nosh”, tags: “webinar”})
  • 35. High Performance Information Computing Center Jongwook Woo CSULA Selected Users [1]
  • 36. High Performance Information Computing Center Jongwook Woo CSULA The Great Divide [1] MongoDB sweet spot: Easy, Flexible, Scalable HBase MongoDB
  • 37. High Performance Information Computing Center Jongwook Woo CSULA Solutions in Big Data Computation  Map/Reduce by Google (Key, Value) parallel computing  Apache Hadoop  Big Data Data Computation (MapReduce, Pig)  Integrating MapReduce and RDB Oracle + Hadoop Sybase IQ Vertica + Hadoop Hadoop DB Greenplum Aster Data  Integrating MapReduce and NoSQL DB MongoDB MapReduce HBase
  • 38. High Performance Information Computing Center Jongwook Woo CSULA Apache Hadoop  Motivated by Google Map/Reduce and GFS  open source project of the Apache Foundation.  framework written in Java – originally developed by Doug Cutting • who named it after his son's toy elephant.  Two core Components  Storage: HDFS – High Bandwidth Clustered storage  Processing: Map/Reduce – Fault Tolerant Distributed Processing  Hadoop scales linearly with  data size  Analysis complexity
  • 39. High Performance Information Computing Center Jongwook Woo CSULA Hadoop issues Map/Reduce is not DB Algorithm in Restricted Parallel Computing HDFS and HBase Cannot compete with the functions in RDBMS But, useful for Useful for huge (peta- or Terra-bytes) but non- complicated data – Web crawling – log analysis • Log file for web companies – New York Times case
  • 40. High Performance Information Computing Center Jongwook Woo CSULA MapReduce Pros & Cons Summary Good when Huge data for input, intermediate, output A few synchronization required Read once; batch oriented datasets (ETL) Bad for Fast response time Large amount of shared data Fine-grained synch needed CPU-intensive not data-intensive Continuous input stream
  • 41. High Performance Information Computing Center Jongwook Woo CSULA MapReduce in Detail Functions borrowed from functional programming languages (eg. Lisp) Provides Restricted parallel programming model on Hadoop User implements Map() and Reduce() Libraries (Hadoop) take care of EVERYTHING else –Parallelization –Fault Tolerance –Data Distribution –Load Balancing
  • 42. High Performance Information Computing Center Jongwook Woo CSULA Map Convert input data to (key, value) pairs map() functions run in parallel,  creating different intermediate (key, value) values from different input data sets
  • 43. High Performance Information Computing Center Jongwook Woo CSULA Reduce reduce() combines those intermediate values into one or more final values for that same key reduce() functions also run in parallel, each working on a different output key Bottleneck: reduce phase can’t start until map phase is completely finished.
  • 44. High Performance Information Computing Center Jongwook Woo CSULA Training in Big Data  Learn by yourself? Miss many important topics Two main: –Cloudera, Hortonworks • With hands-on exercises Cloudera 강의 교재 간단히 소개 Especially MapReduce example
  • 45. High Performance Information Computing Center Jongwook Woo CSULA Example: Sort URLs in the largest hit order Compute the largest hit URLs Stored in log files Map() Input <logFilename, file text> Output: Parses file and emits <url, hit counts> pairs – eg. <http://hello.com, 1> Reduce() Input: <url, list of hit counts> from multiple map nodes Output: Sums all values for the same key and emits <url, TotalCount> – eg.<http://hello.com, (3, 5, 2, 7)> => <http://hello.com, 17>
  • 46. High Performance Information Computing Center Jongwook Woo CSULA Map/Reduce for URL visits … …Map1() Map2() Mapm() Reduce1 () Reducel() Data Aggregation/Combine (http://hi.com, <1, 1, …, 1>) (http://hello.com, <3, 5, 2, 7>) (http://hi.com, 32) (http://hello.com, 17) Input Log Data Reduce2() (http://hi.com, 1) (http://hello.com, 3) … (http://halo.com, 1) (http://hello.com, 5) … (http://halo.com, <1, 5,>) (http://halo.com, 6)
  • 47. High Performance Information Computing Center Jongwook Woo CSULA Legacy Example In late 2007, the New York Times wanted to make available over the web its entire archive of articles, 11 million in all, dating back to 1851. four-terabyte pile of images in TIFF format. needed to translate that four-terabyte pile of TIFFs into more web-friendly PDF files. – not a particularly complicated but large computing chore, • requiring a whole lot of computer processing time.
  • 48. High Performance Information Computing Center Jongwook Woo CSULA Legacy Example (Cont’d) In late 2007, the New York Times wanted to make available over the web its entire archive of articles, a software programmer at the Times, Derek Gottfrid, – playing around with Amazon Web Services, Elastic Compute Cloud (EC2), • uploaded the four terabytes of TIFF data into Amazon's Simple Storage System (S3) • In less than 24 hours, 11 millions PDFs, all stored neatly in S3 and ready to be served up to visitors to the Times site.  The total cost for the computing job? $240 – 10 cents per computer-hour times 100 computers times 24 hours
  • 49. High Performance Information Computing Center Jongwook Woo CSULA Supporters of Big Data: Hadoop Ecosystems  Apache Hadoop Supporters  Cloudera – Like Linux and Redhat – HiPIC is an Academic Partner  Hortonworks – Pig, – Consulting and training  Facebook – Hive  IBM – Jaql  NoSQL DB supporters  MongoDB  HBase, CouchDB, Apache Cassandra (originally by FB) etc
  • 50. High Performance Information Computing Center Jongwook Woo CSULA Pig • developed at Yahoo Research around 2006 o moved into the Apache Software Foundation in 2007. • PigLatin, o Pig's language o a data flow language o well suited to processing unstructured data  Unlike SQL, not require that the data have a schema  However, can still leverage the value of a schema
  • 51. High Performance Information Computing Center Jongwook Woo CSULA Hive • developed at Facebook o turns Hadoop into a data warehouse o complete with a dialect of SQL for querying. • HiveQL o a declarative language (SQL dialect) • Difference from PigLatin, o you do not specify the data flow,  but instead describe the result you want  Hive figures out how to build a data flow to achieve it. o a schema is required,  but not limited to one schema. o data can have many schemas
  • 52. High Performance Information Computing Center Jongwook Woo CSULA Hive (Cont'd) • Similarity with PigLatin and SQL, o HiveQL on its own is a relationally complete language  but not a Turing complete language,  That can express any computation o can be extended through UDFs (User Defined Functions) of Java  just like Pig to be Turing complete
  • 53. High Performance Information Computing Center Jongwook Woo CSULA Jaql • developed at IBM. • a data flow language o its native data structure format is JSON (JavaScript Object Notation). • Schemas are optional • Turing complete on its own o without the need for extension through UDFs.
  • 54. High Performance Information Computing Center Jongwook Woo CSULA MapReduce Cons and Future Bad for Fast response time Large amount of shared data Fine-grained synch needed CPU-intensive not data-intensive Continuous input stream Hadoop 2.0: YARN Not a product yet but will be soon
  • 55. High Performance Information Computing Center Jongwook Woo CSULA Hadoop 2.0: YARN Data processing applications and services Online Serving – HOYA (HBase on YARN) Real-time event processing – Storm, S4, other commercial platforms Tez – Generic framework to run a complex DAG  MPI: OpenMPI, MPICH2  Master-Worker  Machine Learning: Spark  Graph processing: Giraph  Enabled by allowing the use of paradigm-specific application master [http://www.slideshare.net/hortonworks/apache- hadoop-yarn-enabling-nex]
  • 56. High Performance Information Computing Center Jongwook Woo CSULA Big Data Supporters Amazon AWS Facebook Twitter Craiglist
  • 57. High Performance Information Computing Center Jongwook Woo CSULA Amazon AWS amazon.com Consumer and seller business aws.amazon.com IT infrastructure business – Focus on your business not IT management Pay as you go Services with many APIs – S3: Simple Storage Service – EC2: Elastic Compute Cloud • Provide many virtual Linux servers • Can run on multiple nodes – Hadoop and HBase – MongoDB
  • 58. High Performance Information Computing Center Jongwook Woo CSULA Amazon AWS (Cont’d) Customers on aws.amazon.com Samsung – Smart TV hub sites: TV applications are on AWS Netflix – ~25% of US internet traffic – ~100% on AWS NASA JPL – Analyze more than 200,000 images NASDAQ – Using AWS S3 HiPIC received research and teaching grants from AWS
  • 59. High Performance Information Computing Center Jongwook Woo CSULA Facebook [7] Using Apache HBase  For Titan and Puma – Message Services – ETL  HBase for FB – Provide excellent write performance and good reads – Nice features • Scalable • Fault Tolerance • MapReduce
  • 60. High Performance Information Computing Center Jongwook Woo CSULA Titan: Facebook Message services in FB Hundreds of millions of active users 15+ billion messages a month 50K instant message a second Challenges High write throughput – Every message, instant message, SMS, email Massive Clusters – Must be easily scalable Solution Clustered HBase
  • 61. High Performance Information Computing Center Jongwook Woo CSULA Puma: Facebook  ETL  Extract, Transform, Load – Data Integrating from many data sources to Data Warehouse  Data analytics – Domain owners’ web analytics for Ad and apps • clicks, likes, shares, comments etc  ETL before Puma  8 – 24 hours – Procedures: Scribe, HDFS, Hive, MySQL  ETL after Puma  Puma – Real time MapReduce framework  2 – 30 secs – Procedures: Scribe, HDFS, Puma, HBase
  • 62. High Performance Information Computing Center Jongwook Woo CSULA Twitter [8] Three Challenges Collecting Data – Scribe as FB Large Scale Storage and analysis – Cassandra: ColumnFamily key-value store – Hadoop Rapid Learning over Big Data – Pig • 5% of Java code • 5% of dev time • Within 20% of running time
  • 63. High Performance Information Computing Center Jongwook Woo CSULA Craiglist in MongoDB [9] Craiglist ~700 cities, worldwide ~1 billion hits/day ~1.5 million posts/day Servers – ~500 servers – ~100 MySQL servers Migrate to MongoDB Scalable, Fast, Proven, Friendly
  • 64. High Performance Information Computing Center Jongwook Woo CSULA Hadoop Streaming  Hadoop MapReduce for Non-Java codes: Python, Ruby  Requirement  Running Hadoop  Needs Hadoop Streaming API – hadoop-streaming.jar  Needs to build Mapper and Reducer codes – Simple conversion from sequential codes  STDIN > mapper > reducer > STDOUT
  • 65. High Performance Information Computing Center Jongwook Woo CSULA Hadoop Streaming  MapReduce Python execution  http://wiki.apache.org/hadoop/HadoopStreaming  Sysntax $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar [options] Options: -input <path> DFS input file(s) for the Map step -output <path> DFS output directory for the Reduce step -mapper <cmd|JavaClassName> The streaming command to run -reducer <cmd|JavaClassName> The streaming command to run -file <file> File/dir to be shipped in the Job jar file  Example $ bin/hadoop jar contrib/streaming/hadoop-streaming.jar -file /home/jwoo/mapper.py -mapper /home/jwoo/mapper.py -file /home/jwoo/reducer.py -reducer /home/jwoo/reducer.py -input /user/jwoo/shakespeare/* -output /user/jwoo/shakespeare- output
  • 66. High Performance Information Computing Center Jongwook Woo CSULA Conclusion  Era of Big Data  Need to store and compute Big Data  Many solutions but Hadoop  Storage: NoSQL DB  Computation: Hadoop MapRedude  Need to analyze Big Data in mobile computing, SNS for Ad, User Behavior, Patterns …
  • 67. High Performance Information Computing Center Jongwook Woo CSULA Question?