Big Data Fundamentals in the Emerging New Data World


Published on

I talk about the fundamental of Big Data, which includes Hadoop, Intensive Data Computing, and NoSQL database that have received highlight to compute and store BigData that is usually greater then Peta-byte data. Besides, I
introduce the case studies that use Hadoop and NSQL DB.

Published in: Technology, Education
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • If you choose to supply one. Like SQL, PigLatin is relationally complete, which means it is at least as powerful as relational algebra. Turing completeness requires looping constructs, an infinite memory model, and conditional constructs. PigLatin is not Turing complete on its own, but is Turing complete when extended with User-Defined Functions in Java.
  • Big Data Fundamentals in the Emerging New Data World

    1. 1. HiPIC Big Data Fundamentals in the Emerging New Data World PIT (Product Innovation Team) Samsung Electronics America San Jose, CA Aug 17th 2012 Jongwook Woo (PhD) High-Performance Internet Computing Center (HiPIC) Educational Partner with Cloudera and Grants Awardee of Amazon AWS Computer Information Systems Department California State University, Los Angeles Jongwook Woo CSULA
    2. 2. HiPIC Contents Fundamentals of Big Data NoSQL DB: HBase, MongoDB Data-Intensive Computing: Hadoop Big Data Supporters and Use Cases CSULA Jongwook Woo
    3. 3. HiPIC Experience in Big Data Several publications regarding Hadoop and NoSQL  “Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA 2012, Las Vegas (July 16-19, 2012)  “Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho Kim, EDB 2012, Incheon, Aug. 25-27, 2011  “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011, Las Vegas (July 18-21, 2011)  Jongwook Woo, “Introduction to Cloud Computing”, in the 10th KOCSEA Technical Symposium, UNLV, Dec 18 - 19, 2009 Talks in Korean Universities and companies  Yonsei, Sookmyung, KAIST, Korean Polytech Univ – Winter 2011  VanillaBreeze – Winter 2011 CSULA Jongwook Woo
    4. 4. HiPIC Experience in Big Data (Cont’d) Grants  Received Amazon AWS in Education Research Grant (July 2012 - July 2014)  Received Amazon AWS in Education Coursework Grants (July 2012 - July 2013, Jan 2011 - Dec 2011 Partnership  Received Academic Education Partnership with Cloudera since June 2012 Certificate  Certificate of Achievement in the Big Data University Training Course, “Hadoop Fundamentals I”, July 8 2012 Cloud Computing Blog  CSULA Jongwook Woo
    5. 5. What is Big Data, Map/Reduce, Hadoop, NoSQL DB onHiPIC Cloud Computing Clo ude AWS ra Ho L rto DB SQ nW No ork s CSULA Jongwook Woo
    6. 6. HiPIC Big DataToo much data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data, Bioinformatics, Social Computing, smart phone, online game…Cannot handle with the legacy approach Too big Un-/Semi-structured data CSULA Jongwook Woo
    7. 7. HiPIC Two Issues in Big DataHow to store Big Data NoSQL DBHow to compute Big Data Parallel Computing with multiple cheap computers – Not need super computers CSULA Jongwook Woo
    8. 8. HiPIC Contents Fundamentals of Big Data NoSQL DB: HBase, MongoDB Data-Intensive Computing: Hadoop Big Data Supporters and Use Cases CSULA Jongwook Woo
    9. 9. HiPIC New Data Trend Sparsity Schema free data with sparse attributes – Document Term vector – User-Item matrix – Semantic or social relations No relational property – nor complex join queries • Log data CSULA Jongwook Woo
    10. 10. HiPIC New Data Trend (Cont’d)Immutable No need to update and delete data – Only insert with versions • Tracking history • Lock-free (key based autonomicity) CSULA Jongwook Woo
    11. 11. HiPIC Big Data for RDBMS Issues in RDBMS Hard to scale – Relation gets broken • Partitioning for scalability • Replication for availability Speed – The Seek times of physical storage • Slower than N/W speed • 1TB disk: 10Mbps transfer rate – 100K sec =>27.8 hrs – With Multiple data sources at difference places • 100 10GB disks: each 10Mbps transfer rate – 1K sec =>16.7min CSULA Jongwook Woo
    12. 12. HiPIC Big Data for RDBMS (Cont’d)Issues in RDBMS (Cont’d) Data Integration – Not good for un-/semi-structured data • Many unstructured data – Web or log data etc RDB not good in parallelization – Cannot split 1000 tasks to cheap 1000 PCs efficiently CSULA Jongwook Woo
    13. 13. HiPIC RDBMS IssuesSolution Big Data ⇒Data Cleansing by Hadoop ⇒ Data Computation (MapReduce, Pig) ⇒ Data Repositories (NoSQL DB: HBase, Cassandra, MongoDB) ⇒Business Intelligence (Data Mining, OLAP, Data Visualization, Reporting): Hive, Mahout CSULA Jongwook Woo
    14. 14. HiPIC NoSQL DBs not primarily built on tables,  generally do not use SQL for data manipulation  non-relational, distributed data stores – often do not provide ACID (atomicity, consistency, isolation, durability) • which are the key attributes of classic RDB Fast Index on large amount of data  Lookup by keys (key/value) NoSQL normally supports MapReduce  Parallel computation CSULA Jongwook Woo
    15. 15. HiPIC Use Cases for NoSQL DB [1]RDBMS replacement for high-traffic web applicationsSemi-structured content managementReal-time analytics & high-speed loggingWeb Infrastructure Web 2.0, Media, SaaS, Gaming, Finance, Telecom, Healthcare, GovernmentThree NoSQL DB Approaches Key/Value, Column, Document CSULA Jongwook Woo
    16. 16. HiPIC Data Store of NoSQL DB Key/Value store (Key, Value) Functions – Index, versioning, sorting, locking, transaction, replication Apache Cassandra, Memcached CSULA Jongwook Woo
    17. 17. HiPIC Data Store of NoSQL DB (Cont’d) Column-Oriented Stores (Extensible Record Stores) stores data tables as sections of columns of data – rather than as rows of data, like most RDBMS • Sparse fields in RDBMS – well-suited for OLAP-like workloads (e.g., data warehouses) Extensible record horizontally and vertically partitioned across nodes – Rows and Columns are distributed over multiple nodes BigTable, HBase, Cassandra, Hypertable CSULA Jongwook Woo
    18. 18. HiPIC Data Store of NoSQL DB (Cont’d) StudentId Lastname Firstname email1 Smith Joe smith@hi.com2 Jones Mary mary@hi.com3 Johnson Cathy  Row Oriented – 1,Smith, Joe,; – 2,Jones, Mary,; – 3,Johnson, Cathy,; Column Oriented – 1,2,3; – Smith, Jones, Johnson; – Joe, Mary, Cathy; –,,; CSULA Jongwook Woo
    19. 19. HiPIC HBase Schema Example (Student/Course)  RDBMS  Students: (id, name, sex, age)  Courses: (id, title, desc, teacher_id)  S_C: (s_id, c_id, type)  HBase Column Families id Info: Course <student_id> Info:name Info:sex Info:age Course:<course_id>= type Column Families id Info: student<course_id> Info:title Info:desc Info:teacher_id student:<student_id> =type CSULA Jongwook Woo
    20. 20. HiPIC Data Store of NoSQL DB (Cont’d) Document Store Collections and Documents – vs Tables and Records of RDB Used in Search Engine/Repository Multiple index to store indexed document – no fixed fields Not simple key-value lookup – Use API Functions – No locking, Replication, Transaction MongoDB, CouchDB, ThruDB, SimpleDB CSULA Jongwook Woo
    21. 21. HiPIC The Great Divide [1] MongoDB HBase MongoDB sweet spot: Easy, Flexible, Scalable CSULA Jongwook Woo
    22. 22. HiPIC Understanding the Document Model [1] { _id:“A4304” author: “nosh”, date: 22/6/2010, title: “Intro to MongoDB” text: “MongoDB is an open source..”, tags: [“webinar”, “opensource”] comments: [{author: “mike”, date: 11/18/2010, txt: “Did you see the…”, votes: 7},….] } Documents->Collections->Databases CSULA Jongwook Woo
    23. 23. HiPIC Document Model Makes Queries Simple [1] Operators: $gt, $lt, $gte, $lte, $ne, $all, $in, $nin, count, limit, skip, group Example: db.posts.find({author: “nosh”, tags: “webinar”}) CSULA Jongwook Woo
    24. 24. HiPIC Selected Users [1] CSULA Jongwook Woo
    25. 25. HiPIC Contents Fundamentals of Big Data NoSQL DB: HBase, MongoDB Data-Intensive Computing: Hadoop Big Data Supporters and Use Cases CSULA Jongwook Woo
    26. 26. HiPIC Data nowadays• Data Issues o data grows to 10TB, and then 100TB. o Unstructured data coming from sources  like Facebook, Twitter, RFID readers, sensors, and so on.  Need to derive information from both the relational data and the unstructured data • as soon as possible.• Solution to efficiently compute Big Data o Hadoop Map/Reduce CSULA Jongwook Woo
    27. 27. HiPIC Solutions in Big Data Computation Map/Reduce by Google  (Key, Value) parallel computing Apache Hadoop  Big Data ⇒Data Computation (MapReduce, Pig) Integrating MapReduce and RDB  Oracle + Hadoop  Sybase IQ  Vertica + Hadoop  Hadoop DB  Greenplum  Aster Data Integrating MapReduce and NoSQL DB  MongoDB MapReduce  HBase CSULA Jongwook Woo
    28. 28. HiPIC Apache Hadoop Motivated by Google Map/Reduce and GFS  open source project of the Apache Foundation.  framework written in Java – originally developed by Doug Cutting • who named it after his sons toy elephant. Two core Components  Storage: HDFS – High Bandwidth Clustered storage  Processing: Map/Reduce – Fault Tolerant Distributed Processing Hadoop scales linearly with  data size  Analysis complexity CSULA Jongwook Woo
    29. 29. HiPIC Hadoop issues Map/Reduce is not DB  Algorithm in Restricted Parallel Computing HDFS and HBase  Cannot compete with the functions in RDBMS But, useful for  Semi-structured data model and high-level dataflow query language on top of MapReduce – Pig, Hive, Jsql, Cascading, Cloudbase  Useful for huge (peta- or Terra-bytes) but non-complicated data – Web crawling – log analysis • Log file for web companies – New York Times case CSULA Jongwook Woo
    30. 30. HiPIC MapReduce Pros & Cons Summary Good when Huge data for input, intermediate, output A few synchronization required Read once; batch oriented datasets (ETL) Bad for Fast response time Large amount of shared data Fine-grained synch needed CPU-intensive not data-intensive Continuous input stream CSULA Jongwook Woo
    31. 31. HiPIC MapReduce in DetailFunctions borrowed from functional programming languages (eg. Lisp)Provides Restricted parallel programming model on Hadoop User implements Map() and Reduce() Libraries (Hadoop) take care of EVERYTHING else – Parallelization – Fault Tolerance – Data Distribution – Load Balancing CSULA Jongwook Woo
    32. 32. HiPIC MapConvert input data to (key, value) pairsmap() functions run in parallel,  creating different intermediate (key, value) values from different input data sets CSULA Jongwook Woo
    33. 33. HiPIC Reduce reduce() combines those intermediate values into one or more final values for that same key reduce() functions also run in parallel, each working on a different output key Bottleneck: reduce phase can’t start until map phase is completely finished. CSULA Jongwook Woo
    34. 34. HiPIC Example: Sort URLs in the largest hit orderCompute the largest hit URLs  Stored in log filesMap()  Input <logFilename, file text>  Output: Parses file and emits <url, hit counts> pairs – eg. <, 1>Reduce()  Input: <url, list of hit counts> from multiple map nodes  Output: Sums all values for the same key and emits <url, TotalCount> – eg.<, (3, 5, 2, 7)> => <, 17> CSULA Jongwook Woo
    35. 35. HiPIC Map/Reduce for URL visits Input Log Data Map1() Map2() … Mapm() (, 1) (, 1) (, 3) (, 5) … … Data Aggregation/Combine (, <1, 1, …, 1>) (, <1, 5,>) (, <3, 5, 2, 7>) Reduce1 () Reduce2() … Reducel() (, 32) (, 6) (, 17) CSULA Jongwook Woo
    36. 36. HiPIC Legacy ExampleIn late 2007, the New York Times wanted to make available over the web its entire archive of articles, 11 million in all, dating back to 1851. four-terabyte pile of images in TIFF format. needed to translate that four-terabyte pile of TIFFs into more web-friendly PDF files. – not a particularly complicated but large computing chore, • requiring a whole lot of computer processing time. CSULA Jongwook Woo
    37. 37. HiPIC Legacy Example (Cont’d)In late 2007, the New York Times wanted to make available over the web its entire archive of articles, a software programmer at the Times, Derek Gottfrid, – playing around with Amazon Web Services, Elastic Compute Cloud (EC2), • uploaded the four terabytes of TIFF data into Amazons Simple Storage System (S3) • In less than 24 hours, 11 millions PDFs, all stored neatly in S3 and ready to be served up to visitors to the Times site.  The total cost for the computing job? $240 – 10 cents per computer-hour times 100 computers times 24 hours CSULA Jongwook Woo
    38. 38. HiPIC Contents Fundamentals of Big Data NoSQL DB: HBase, MongoDB Data-Intensive Computing: Hadoop Big Data Supporters and Use Cases CSULA Jongwook Woo
    39. 39. HiPIC Supporters of Big Data Apache Hadoop Supporters  Cloudera – Like Linux and Redhat – HiPIC is an Academic Partner  Hortonworks – Pig  Facebook – Hive  IBM – Jaql NoSQL DB supporters  MongoDB – HiPIC tries to collaborate  HBase, CouchDB, Apache Cassandra (originally by FB) etc CSULA Jongwook Woo
    40. 40. HiPIC Similarities in Pig, Hive, and Jaql• translate high-level languages into MapReduce jobs o the programmer can work at a higher level  than writing MapReduce jobs in Java or other lower-level languages• programs are much smaller than Java code.• option to extend these languages, o often by writing user-defined functions in Java.• Interoperability o programs written in these high-level languages can be imbedded inside other languages as well.• the same limitations as Hadoop does o non-supporting random reads and writes o and low-latency queries. CSULA Jongwook Woo
    41. 41. HiPIC Pig• developed at Yahoo Research around 2006 o moved into the Apache Software Foundation in 2007.• PigLatin, o Pigs language o a data flow language o well suited to processing unstructured data  Unlike SQL, not require that the data have a schema  However, can still leverage the value of a schema CSULA Jongwook Woo
    42. 42. HiPIC Hive• developed at Facebook o turns Hadoop into a data warehouse o complete with a dialect of SQL for querying.• HiveQL o a declarative language (SQL dialect)• Difference from PigLatin, o you do not specify the data flow,  but instead describe the result you want  Hive figures out how to build a data flow to achieve it. o a schema is required,  but not limited to one schema. o data can have many schemas CSULA Jongwook Woo
    43. 43. HiPIC Hive (Contd)• Similarity with PigLatin and SQL, o HiveQL on its own is a relationally complete language  but not a Turing complete language,  That can express any computation o can be extended through UDFs (User Defined Functions) of Java  just like Pig to be Turing complete CSULA Jongwook Woo
    44. 44. HiPIC Jaql• developed at IBM.• a data flow language o its native data structure format is JSON (JavaScript Object Notation).• Schemas are optional• Turing complete on its own o without the need for extension through UDFs. CSULA Jongwook Woo
    45. 45. HiPIC Use Cases Amazon AWS Facebook Twitter Craiglist HuffPOst | AOL CSULA Jongwook Woo
    46. 46. HiPIC Amazon AWS  Consumer and seller business  IT infrastructure business – Focus on your business not IT management  Pay as you go – Pay for servers by the hour – Pay for storage per Giga byte per month – Pay for data transfer per Giga byte  Services with many APIs – S3: Simple Storage Service – EC2: Elastic Compute Cloud • Provide many virtual Linux servers • Can run on multiple nodes – Hadoop and HBase – MongoDB CSULA Jongwook Woo
    47. 47. HiPIC Amazon AWS (Cont’d) Customers on Samsung – Smart TV hub sites: TV applications are on AWS Netflix – ~25% of US internet traffic – ~100% on AWS NASA JPL – Analyze more than 200,000 images NASDAQ – Using AWS S3 HiPIC received research and teaching grants from AWS CSULA Jongwook Woo
    48. 48. HiPIC Facebook [7] Using Apache HBase For Titan and Puma HBase for FB – Provide excellent write performance and good reads – Nice features • Scalable • Fault Tolerance • MapReduce CSULA Jongwook Woo
    49. 49. HiPIC Titan: Facebook Message services in FB Hundreds of millions of active users 15+ billion messages a month 50K instant message a second Challenges High write throughput – Every message, instant message, SMS, email Massive Clusters – Must be easily scalable Solution Clustered HBase CSULA Jongwook Woo
    50. 50. HiPIC Puma: Facebook ETL  Extract, Transform, Load – Data Integrating from many data sources to Data Warehouse  Data analytics – Domain owners’ web analytics for Ad and apps • clicks, likes, shares, comments etc ETL before Puma  8 – 24 hours – Procedures: Scribe, HDFS, Hive, MySQL ETL after Puma  Puma – Real time MapReduce framework  2 – 30 secs – Procedures: Scribe, HDFS, Puma, HBase CSULA Jongwook Woo
    51. 51. HiPIC Twitter [8] Three Challenges Collecting Data – Scribe as FB Large Scale Storage and analysis – Cassandra: ColumnFamily key-value store – Hadoop Rapid Learning over Big Data – Pig • 5% of Java code • 5% of dev time • Within 20% of running time CSULA Jongwook Woo
    52. 52. HiPIC Craiglist in MongoDB [9] Craiglist ~700 cities, worldwide ~1 billion hits/day ~1.5 million posts/day Servers – ~500 servers – ~100 MySQL servers Migrate to MongoDB Scalable, Fast, Proven, Friendly CSULA Jongwook Woo
    53. 53. HiPIC HuffPost | AOL [10]Two Machine Learning Use Cases Comment Moderation – Evaluate All New HuffPost User Comments Every Day • Identify Abusive / Aggressive Comments • Auto Delete / Publish ~25% Comments Every Day Article Classification – Tag Articles for Advertising • E.g.: scary, salacious, … CSULA Jongwook Woo
    54. 54. HiPIC HuffPost | AOL [10] Parallelize on Hadoop Good news: – Mahout, a parallel machine learning tool, is already available. – There are Mallet, libsvm, Weka, … that support necessary algorithms. Bad news: – Mahout doesn’t support necessary algorithms yet. – Other algorithms do not run natively on Hadoop. build a flexible ML platform running on Hadoop Pig for Hadoop implementation. CSULA Jongwook Woo
    55. 55. HiPIC MapReduce Example Word Count in the previous slide Shortest Path in the graph  Graph algorithm is very suitable for M/R, especially BFS – Spreading activation type of processing  Map: – Input: a node n as a key, and (D, points-to) as its value • D is the distance to the node from the start • points-to is a list of nodes reachable from n – Output: ∀p ∈ points-to, emit (p, D+1)  Reduce: – Input: possible distances to a given p – Output: selects the minimum one • Perform multiple iterations  Iterative process for matrix, graph, network – Apache HAMA needed? • Iterative Process on Hadoop CSULA Jongwook Woo
    56. 56. HiPIC MapReduce Example (Cont’d) Social N/W analysis  Recommend new friends (friend of a friend: FOAF)  Map – In: (x, <friendsx>) – Out: if (u, x) are friends • (u, < friendsx / friendsu >) – < friendsx / friendsu >: friends of x but not friends of u – Otherwise • nil  Reduce – In: (u, < < friendsa / friendsu >, < friendsa / friendsu >, …>) • Friends list of all users a, b, … who are friends of u – Out: (u, < (X1 , N1 ), (X2 , N2 ), …>) • Xm : FOAF of u • Nm : Total number of occurrences in all FOAF lists – To sort or rank the results CSULA Jongwook Woo
    57. 57. HiPIC MapReduce Example (Cont’d) Inverted Indexing (Full Text Search)  Map (3 nodes): – Input: • Doc1: “Columbus’s egg” • Doc 2: “The chicken and egg problem” • Doc 3: “Easter Egg” – Output: • Map1: (“columbus’s”, (doc1, 1)), (“egg”, (doc1, 2)) • Map2: (“the”, (doc2, 1)), (“chicken”, (doc2, 2)), (“and”, (doc2, 3)), (“egg”, (doc2, 4)), (“problem”, (doc2, 5)) • Map3: (“easter”, (doc3, 1)), (“egg”, (doc3, 2))  Intermediate Shuffle – (“columbus’s”, (doc1, 1)), (“egg”, <(doc1, 2), (doc2, 4), (doc3, 2)>), (“the”, (doc2, 1)), (“chicken”, (doc2, 2)), (“and”, (doc2, 3)), (“problem”, (doc2, 5)), (“easter”, (doc3, 1))) CSULA Jongwook Woo
    58. 58. HiPIC MapReduce Example (Cont’d)Inverted Indexing (Full Text Search) (Cont’d) Reduce – Input: (“columbus’s”, (doc1, 1)), (“egg”, <(doc1, 2), (doc2, 4), (doc3, 2)>), (“the”, (doc2, 1)), (“chicken”, (doc2, 2)), (“and”, (doc2, 3)), (“problem”, (doc2, 5)), (“easter”, (doc3, 1))) – Output: same as above • Assuming (“egg”, <(doc1, 2), (doc1, 4), (doc3, 2)>), output is: – (“egg”, <(doc1, <2, 4>), (doc3, 2)>), CSULA Jongwook Woo
    59. 59. HiPIC Conclusion Era of Big Data Need to store and compute Big Data Storage: NoSQL DB Computation: Hadoop MapRedude Need to analyze Big Data in mobile computing, SNS for Ad, User Behavior, Patterns … CSULA Jongwook Woo
    60. 60. HiPIC CSULA Jongwook Woo
    61. 61. HiPIC References1) Introduction to MongoDB, Nosh Petigara, Jan 11, 20112) Hadoop Fundamental I, Big Data University3) “Large Scale Data Analysis with Map/Reduce”, Marin Dimitrov, Feb 20104) “BFS & MapReduce”, Edward J Yoon mapreduce.html, Feb 26 20095) “Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho Kim, The Third International Conference on Emerging Databases (EDB 2011), Songdo Park Hotel, Incheon, Korea, Aug. 25-27, 2011 CSULA Jongwook Woo
    62. 62. HiPIC References6) “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, The 2011 international Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2011),Las Vegas (July 18-21, 2011)7) Building Realtime Big Data Services at Facebook with Hadoop and Hbase, Jonathan Gray, Facebook, Nov 11, 2011, Hadoop World NYC8) Analyzing Big Data at Twitter, Kevin Well, Web 2.0 Expo, NYC, Sep 20109) Lessons Learned from Migrating 2+ Billion Documents at Craigslist, Jeremy Zawodny, 201110) Machine Learning on Hadoop at Huffington Post | AOL, Thu Kyaw and Sang Chul Song, Hadoop DC, Oct 4, 2011 CSULA Jongwook Woo
    63. 63. HiPIC References11) “MapReduce Debates and Schema-Free”, Woohyun Kim,,, March 3 201012) “Large Scale Data Analysis with Map/Reduce”, Marin Dimitrov, Feb 201013) “HBase Schema Design Case Studies”, Qingyan Liu, July 13 2009 CSULA Jongwook Woo