Notes From the Front Line:    Hadoop, NoSql, RDBMS, KattaAlexey GaydukRoman Nikolaenko
E-commerce
Ama zon   Macy’s E-commerceTa rget    Walmart
Intelligence
Product
What are the characteristics?
●   Size●   Material●   Pocket Style●   Weave type●   Hem Style●   Cleaning●   Fit
Target    Wrangler Jeans - Women’s    Bootcut Jeans●   Weave Type: Denim●   Pocket Style: 5 Pocket    Pockets●   Cleaning:...
Target                         Walmart    Wrangler Jeans - Women’s       Wrangler® Womens Bootcut    Bootcut Jeans        ...
Target                             Walmart    Wrangler Jeans - Women’s           Wrangler® Womens Bootcut    Bootcut Jeans...
What problems do we solve?  ● Data capture  ● Processing and storage  ● Reports
Data CaptureCrawler          ● Distributed            Uses EC2 for crawling          ● Failover            The failed node...
Data Capture                  JSON   {      "Name":"Wrangler® Womens Bootcut Jean -       Grey",      "Weave Type":"Denim"...
Processing and Storage  ● Distributed data storage  ● Distributed data processing
http://dev-time.org/?p=893HDFS (Hadoop Distributed File System) ●   Files are stored as blocks ●   Write once, read many t...
Data Matching    Target                             Walmart    Wrangler Jeans - Women’s           Wrangler® Womens Bootcut...
Data Matching              How to match?Wrangler Jeans - Women’s   Wrangler® Womens BootcutBootcut Jeans              Jean...
Katta (Lucene index storage)● Distributed storage of Lucene index● Makes serving large or high load indices  easy.● Failov...
Its time to...Roman Nikolaenko
Crawled Data Target       AmazoneBay Sears         Walmart
Crawled Data          "Human"           Cloud
"Human"            CloudData Storage for    Reports
Data Storage for    Reports
"Human" Cloud
Hadoop
D     Crawled Data     Ta                      at                      sa                      k      HadoopF             ...
{   "Name":"Wrangler® Womens Bootcut Jean -    Grey",                                                   T   "Weave Type":"...
0. Load data to HDFS                       T                                           a                                  ...
TaskChain
Control      T               a               s               kCoordination               C               h               a...
Task   Task Type          P   Task   Task Type          R          O   Task   Task TypeProject   JManager          E   Tas...
http://example0.com/dataloader/service/        TASK                                                      Phttp://example1....
REST API:      (JSON as DTO)GET PARAMETERS of TASKSTART TASKGET STATUS of TASKSTOP TASK
REST API:          (JSON as DTO)GET PARAMETERS of TASKSTART TASKGET STATUS of TASKSTOP TASK  TASK_TYPE: http://example0.  ...
REST API:       Task Type: http://example0.        com/dataloader/service/Task                                 Task       ...
Project Manager             Task TypeTask 1                            Task 2            Service URL            Web Servic...
Web Service: Start MapReduceJob job = getMapReduceJob();job.waitForCompletion(true);             OR       job.submit();
Web Service: MapReduce monitoring        job.isComplete();       job.isSuccessful();       job.mapProgress();      job.red...
Web Service: MapReduce &Third Party LibrariesSystem.setProperty("path.separator",":");Configuration config = getConfig();F...
Web Service: MapReduce &Third Party LibrariesJob configuration file on cluster will contain:mapred.cache.archives =hdfs://...
Reporting Application
Dashboard
Get Report         Create ReportREST API             REST API Reports Data Storage Facade     MongoDB cluster
http://spf13.com/post/mongodb-and-hadoop
MongoDB clusterCLIENT_546 collection:...{"report_type":"PROD COUNT", "snapshot_time" : "2011-01-15", "AMAZON":"500", "TARG...
UI For ClientJSON query                   JSON report   Reports Data Storage Facade
HadoopReducer            Reducer      JSON report rowReports Data Storage Facade
ContactsAlexey Gayduk      gayduk.a.s.ua@gmail.com      oleksiy_gayduk      http://www.linkedin.com/pub/alexey-gayduk/4/39...
Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta
Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta
Upcoming SlideShare
Loading in …5
×

Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

1,531 views

Published on

Хотите услышать о проекте, где используется стек технологий из Hadoop для распределенной обработки и хранения данных, Katta для распределенного хранения и обработки Lucene индексов, MongoDB для хранения неструктурированных данных? Мы хотели бы рассказать о реальном опыте применения этой связки, с какими проблемами мы столкнулись и как мы их решали. Допустим одна из проблем это использование сторонних библиотек в Hadoop Map/Reduce, все очевидно, но как сделать это красиво и удобно? Или как запустить Hadoop job из под web приложения, а не из консоли, и мониторить ее выполнение? А вот проблема хранения и обработки неструктурированных данных в MySql. Что за данные мы хранили там и почему решили использовать MongoDB? И зачем же мы все-таки используем Katta? Все эти проблемы и их решения исходят из реальной бизнес идеи, и обо всем этом мы расскажем вам.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,531
On SlideShare
0
From Embeds
0
Number of Embeds
890
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Notes From the Front Line: Hadoop, NoSql, RDBMS, Katta

  1. 1. Notes From the Front Line: Hadoop, NoSql, RDBMS, KattaAlexey GaydukRoman Nikolaenko
  2. 2. E-commerce
  3. 3. Ama zon Macy’s E-commerceTa rget Walmart
  4. 4. Intelligence
  5. 5. Product
  6. 6. What are the characteristics?
  7. 7. ● Size● Material● Pocket Style● Weave type● Hem Style● Cleaning● Fit
  8. 8. Target Wrangler Jeans - Women’s Bootcut Jeans● Weave Type: Denim● Pocket Style: 5 Pocket Pockets● Cleaning: Machine Wash Cold● Rise: Low Rise Rise● Fit: 3, Mid Waist● Decorative Details: Top Stitching● Protective Features: Stretch● Hem Style: Finished Hems
  9. 9. Target Walmart Wrangler Jeans - Women’s Wrangler® Womens Bootcut Bootcut Jeans Jean - Grey● Weave Type: Denim● Pocket Style: 5 Pocket Pockets● Cleaning: Machine Wash Cold● Rise: Low Rise Rise● Fit: 3, Mid Waist● Decorative Details: Top Stitching● Protective Features: Stretch● Hem Style: Finished Hems
  10. 10. Target Walmart Wrangler Jeans - Women’s Wrangler® Womens Bootcut Bootcut Jeans Jean - Grey● Weave Type: Denim ● Weave Type: Denim● Pocket Style: 5 Pocket ● Pockets: 2 hip pockets, Pockets 1 watch pocket,● Cleaning: Machine Wash Cold 2 front scoop pockets● Rise: Low Rise Rise ● Fabric Care Instructions:● Fit: 3, Mid Waist Machine Wash,Tumble Dry● Decorative Details: Top ● Decorative Details: Top Stitching Stitching● Protective Features: Stretch ● Fabric Content: Cotton,● Hem Style: Finished Hems Spandex
  11. 11. What problems do we solve? ● Data capture ● Processing and storage ● Reports
  12. 12. Data CaptureCrawler ● Distributed Uses EC2 for crawling ● Failover The failed node will be replaced with another.
  13. 13. Data Capture JSON { "Name":"Wrangler® Womens Bootcut Jean - Grey", "Weave Type":"Denim", "Pockets":"2 hip pockets, 1 watch pocket, 2 front scoop pockets", "Fabric Care Instructions":"Machine Wash,Tumble Dry", "Decorative Details":"Top Stitching", "Fabric Content":"Cotton,Spandex" }
  14. 14. Processing and Storage ● Distributed data storage ● Distributed data processing
  15. 15. http://dev-time.org/?p=893HDFS (Hadoop Distributed File System) ● Files are stored as blocks ● Write once, read many times ● Reliability by replication ● One central point of access to files
  16. 16. Data Matching Target Walmart Wrangler Jeans - Women’s Wrangler® Womens Bootcut Bootcut Jeans Jean - Grey● Weave Type: Denim ● Weave Type: Denim● Pocket Style: 5 Pocket ● Pockets: 2 hip pockets, Pockets 1 watch pocket,● Cleaning: Machine Wash Cold 2 front scoop pockets● Rise: Low Rise Rise ● Fabric Care Instructions:● Fit: 3, Mid Waist Machine Wash,Tumble Dry● Decorative Details: Top ● Decorative Details: Top Stitching Stitching● Protective Features: Stretch ● Fabric Content: Cotton,● Hem Style: Finished Hems Spandex
  17. 17. Data Matching How to match?Wrangler Jeans - Women’s Wrangler® Womens BootcutBootcut Jeans Jean - Grey
  18. 18. Katta (Lucene index storage)● Distributed storage of Lucene index● Makes serving large or high load indices easy.● Failover● Data replication● Easy to scale● Plays well with Hadoop cluster
  19. 19. Its time to...Roman Nikolaenko
  20. 20. Crawled Data Target AmazoneBay Sears Walmart
  21. 21. Crawled Data "Human" Cloud
  22. 22. "Human" CloudData Storage for Reports
  23. 23. Data Storage for Reports
  24. 24. "Human" Cloud
  25. 25. Hadoop
  26. 26. D Crawled Data Ta at sa k HadoopF Cl ho aw Data Storage for i Reports n
  27. 27. { "Name":"Wrangler® Womens Bootcut Jean - Grey", T "Weave Type":"Denim", "Pockets":"2 hip pockets, 1 watch pocket, 2 front scoop apockets", "Fabric Care Instructions":"Machine Wash,Tumble Dry", s "Decorative Details":"Top Stitching", "Fabric Content":"Cotton,Spandex" k}{ C "NAME":"Womens Bootcut Jean", "MFGR_NAME":WRANGLER", h "COLOR":"GREY", "WEAVE_TYPE":"DENIM", a "POKETS_TYPES":"2 HIP|1 WATCH|2 FRONT SCOOP", "CARE_INSTRUCTIONS":"MACHINE WASH|TUMBLE DRY", i "DECORATIVE_DETAILS":"Top Stitching", "CONTENT":"COTTON, SPANDEX", n "FINGERPRINT":"Womens Bootcut Jean!WRANGLER", "FINGERPRINT_HASH":"3902152632", "MAS_PROD_ID":"72312"}
  28. 28. 0. Load data to HDFS T a s1. Attribute Name Transformation k2. Attribute Values Normalization C3. Create Fingerprints h a4. Make Mappings i n5. Load data to Data Storage for Reports
  29. 29. TaskChain
  30. 30. Control T a s kCoordination C h a Monitoring i n Comfort
  31. 31. Task Task Type P Task Task Type R O Task Task TypeProject JManager E Task Task Type C T Task Task Type Task Task Type
  32. 32. http://example0.com/dataloader/service/ TASK Phttp://example1.com/transformation/service/ TASK R Ohttp://example2.com/normalization/service/ TASK J Ehttp://example1.com/fingerprint/service/ TASK Chttp://example3.com/lookup/service/ TASK Thttp://example4.com/reportingLoader/service/ TASK
  33. 33. REST API: (JSON as DTO)GET PARAMETERS of TASKSTART TASKGET STATUS of TASKSTOP TASK
  34. 34. REST API: (JSON as DTO)GET PARAMETERS of TASKSTART TASKGET STATUS of TASKSTOP TASK TASK_TYPE: http://example0. com/dataloader/service/
  35. 35. REST API: Task Type: http://example0. com/dataloader/service/Task Task Task Task
  36. 36. Project Manager Task TypeTask 1 Task 2 Service URL Web Service Hadoop Local processing
  37. 37. Web Service: Start MapReduceJob job = getMapReduceJob();job.waitForCompletion(true); OR job.submit();
  38. 38. Web Service: MapReduce monitoring job.isComplete(); job.isSuccessful(); job.mapProgress(); job.reduceProgress();
  39. 39. Web Service: MapReduce &Third Party LibrariesSystem.setProperty("path.separator",":");Configuration config = getConfig();FileSystem fileSystem = getFS();fileSystem.copyFromLocalFile(source,destination);DistributedCache.addArchiveToClassPath(destination,config, fileSystem);
  40. 40. Web Service: MapReduce &Third Party LibrariesJob configuration file on cluster will contain:mapred.cache.archives =hdfs://namenode.com:9000/distributedCache/gson-1.7.1.jar,...mapred.job.classpath.archives =/distributedCache/gson-1.7.1.jar:...
  41. 41. Reporting Application
  42. 42. Dashboard
  43. 43. Get Report Create ReportREST API REST API Reports Data Storage Facade MongoDB cluster
  44. 44. http://spf13.com/post/mongodb-and-hadoop
  45. 45. MongoDB clusterCLIENT_546 collection:...{"report_type":"PROD COUNT", "snapshot_time" : "2011-01-15", "AMAZON":"500", "TARGET":"300","WALMART":"900"}...{"report_type":"PRICE COMPARE", "MAS_PROD_ID":"72312","snapshot_time" : "2011-02-15","NAME":"Womens BootcutJean","MFGR_NAME":WRANGLER","AMAZON_PRICE":"50","TARGET_PRICE":"45","WALMART_PRICE":"55"}...
  46. 46. UI For ClientJSON query JSON report Reports Data Storage Facade
  47. 47. HadoopReducer Reducer JSON report rowReports Data Storage Facade
  48. 48. ContactsAlexey Gayduk gayduk.a.s.ua@gmail.com oleksiy_gayduk http://www.linkedin.com/pub/alexey-gayduk/4/39b/a31Roman Nikolaenko sage.nrs@gmail.com roman_jd_nikolaenko http://ua.linkedin.com/pub/roman-nikolaienko/2b/413/431

×