Big data nyu


Published on

Big data talk done for Stern NY

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big data nyu

  1. 1. Big Data in the “Real World”Edward Capriolo
  2. 2. What is “big data”?● Big data is a collection of data sets so large andcomplex that it becomes difficult to processusing traditional data processing applications.● The challenges include capture, curation,storage, search, sharing, transfer, analysis,and visualization.
  3. 3. Big Data Challenges●The challenges include:– capture– curation– storage– search– sharing– transfer– analysis– visualization– large– complex
  4. 4. What is “big data” exactly?● What is considered "big data" varies depending onthe capabilities of the organization managing theset, and on the capabilities of the applications thatare traditionally used to process and analyze thedata set in its domain.● As of 2012, limits on the size of data sets that arefeasible to process in a reasonable amount oftime were on the order of exabytes of data.
  5. 5. Big Data Qualifiers● varies● capabilities● traditionally● feasibly● reasonably● [somptha]bytes of data
  6. 6. My first “big data” challenge● Real time news delivery platform● Ingest news as text and provide full text search● Qualifiers– Reasonable: Real time search was < 1 second– Capabilities: small company, <100 servers● Big Data challenges– Storage: roughly 300GB for 60 days data– Search: searches of thousands of terms
  7. 7. Traditionally● Data was placed in mysql● MySQL full text search● Easy to insert● Easy to search● Worked great!– Until it got real world load
  8. 8. Feasibly in hardware(circa 2008)● 300GB data and 16GB ram● ...MySQL stores an in-memory binary tree of the keys.Using this tree, MySQL can calculate the count of matchingrows with reasonable speed. But speed declineslogarithmically as the number of terms increases.● The platters revolve at 15,000 RPM or so, which works outto 250 revolutions per second. Average latency is listed as2.0ms● As the speed of an HDD increases the power it takes to runit increases disproportionately
  9. 9. “Big Data” is about giving up things● In theoretical computer science, the CAP theorem statesthat it is impossible for a distributed computer system tosimultaneously provide all three of the following guarantees:– Consistency (all nodes see the same data at the same time)– Availability (a guarantee that every request receives a responseabout whether it was successful or failed)– Partition tolerance (the system continues to operate despitearbitrary message loss or failure of part of the system)
  10. 10. Multi-Master solution● Write the data to N mysql servers and roundrobin reads between them– Good: More machines to serve reads– Bad: Requires Nx hardware– Hard: Keeping machines loaded with same dataespecially auto-generated-ids– Hard: What about when the data does not even fiton a single machine?
  11. 11. Sharding● Rather then replicate all data to all machines● Replicate data to selective machines– Good: localized data– Good: better caching– Hard: Joins across shards– Hard: Management– Hard: Failure● Parallel RDBMS = $$$
  12. 12. Life lesson“applications that are traditionally used to”● How did we solve our problem?– We switched to lucene● A tool designed for full text search● Eventually sharded lucene● When you hold a hammer:– Not everything is a nail● Understand what you really need● Understand reasonable and feasable
  13. 13. Big data Challenge 2● Large high volume web site● Process them and produce reports● Big Data challenges– Storage: Store GB of data a day for years– Analysis, visualization: support reports of existing system● Qualifiers– Reasonable to want daily reports less then one day– Honestly needs to be faster / reruns etc
  14. 14. Enter hadoop● Hadoop (0.17.X) was fairly new at the time● Use cases of map reduce were emerging– Hive had just been open sourced by Facebook● Many database vendors were callingmap/reduce “a step backwards”– They had solved these problems “in the 80s”
  15. 15. Hadoop file system HDFS● Distributed redundant storage– We were a NoSPOF across the board● Commodity hardware vs buying a bigSAN/NAS device● We already had processes that scped data toservers, easily adapted to placing them intohdfs● HDFS easy huge
  16. 16. Map Reduce● As a proof of concept I wrote a group/countapplication that would group/count on columnin our logs● Was able to show linear speed up withincreased nodes●
  17. 17. Winning (why hadoop kicked arse)● Data capture, curation– bulk loading data into RDBMS (indexes, overhead)– bulk loading into hadoop is network copy● Data anaysis– RDBMS would not parallel-ize queries (even acrosspartitions)– Some queries could cause very locks andperformance degradation
  18. 18. Enter hive● Capture- NO● Curation- YES● Storage- YES● Search- YES● Sharing- YES● Transfer- NO● Analysis-YES● Visualization-NO
  19. 19. Logging from apache to hive
  20. 20. Sample program group and countSource data looks likejan 10 2009:.........:200:/index.htmjan 10 2009:.........:200:/index.htmjan 10 2009:.........:200:/igloo.htmjan 10 2009:.........:200:/ed.htm
  21. 21. In case your the math type(input) <k1, v1> →map -> <k2, v2> -> combine -> <k2, v2> ->reduce -> <k3, v3> (output)Map(k1,v1) -> list(k2,v2)Reduce(k2, list (v2)) -> list(v3)
  22. 22. A mapper
  23. 23. A reducer
  24. 24. Hive stylehive>create table web_data( sdate STRING, stime STRING,envvar STRING, remotelogname STRING ,servername STRING,localip STRING, literaldash STRING, method STRING, urlSTRING, querystring STRING, status STRING, litteralzeroSTRING ,bytessent INT,header STRING, timetoserver INT,useragent STRING ,cookie STRING, referer STRING);SELECT url,count(1) FROM web_data GROUP BY url;
  25. 25. Life lessons volume 2● feasible and reasonable were completelydifferent then case 1#● Query from seconds -> hours● Size from GB to TB● Feasilble from 4 Nodes to 15
  26. 26. Big Data Challenge #3(work at m6d)● Large high volume ad serving site● Process them and produce reports● Support data science and biz-dev users● Big Data challenges– Storage: Store and process terabytes of data● Complex data types, encoded data– Analysis, visualization: support reports of existing system● Qualifiers– Reasonable: adhoc, daily,hourly, weekly, monthly reports
  27. 27. Data data everywhere● We have to use cookies in many places● Cookies have limited size● Cookies have complex values encoded
  28. 28. Some encoding tricks we might doLastSeen: long (64 bits)Segment: int (32 bits)Literal ,Segment: int (32 bits)Zipcode (32bits)● 1 chose a relevantepoc and use byte● Use a byte for # ofsegments● Use a 4 byte radixencoded number● ... and so on
  29. 29. Getting at embedded data● Write N UDFS for each object like:– getLastSeenForCookie(String)– getZipcodeForCookie(String)– ...● But this would have made a huge toolkit● Traditionally you do not want to break firstnormal form
  30. 30. Struct solution● Hive has a struct like a c struct● Struct is list of name value pair● Structs can contain other structs● This gives us the serious ability to do objectmapping● UDFs can return struct types
  31. 31. Using a UDF● add jar myjar.jar;● Create temporary function parseCookie ascom.md6.ParseCookieIntoStruct ;● SelectparseCookie(encodedColumn).lastSeen frommydata;
  32. 32. LATERAL VIEW + EXPLODESELECTclient_id, entry.spendcreativeidFROM datatableLATERAL VIEW explode (AdHistoryAsStruct(ad_history).adEntrylist)entryList as entrywhere hit_date=20110321 AND mid=001406;3214498023360851706 2152863214498023360851706 1957853214498023360851706 128640
  33. 33. All that data might boil down to...
  34. 34. Life lessons volume #3● Big data is not only batch or real-time● Big data is feed back loops– Machine learning– Ad hoc performance checks● Generated SQL tables periodically synced toweb server● Data shared between sections of anorganization to make business decisions