Leveraging Hadoop to mine customer insights in a developing market


Published on

I was a speaker at Big data world conference in London on the 18th september 2012.
See full text speech at http://webkpis.com/2012/11/hadoop-implementation-in-wikimart/

Incorporating Hadoop technology within your infrastructure to cut costs and increase the scale of your operations
Understanding how Hadoop can provide insightful data analysis to the end user
Combining Hadoop with existing enterprise systems to deepen your insight and discover previously hidden trends
Will Hadoop replace the need for relational data warehousing systems?

Published in: Technology
1 Comment
1 Like
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Leveraging Hadoop to mine customer insights in a developing market

  1. 1. Leveraging Hadoop in Wikimart Roman Zykov Head of analytics http://wikimart.ruLondon, Big Data World Europe, 20th September 2012
  2. 2. Key problemTo be or not to be….HadoopIntroduction
  3. 3. Key tasks for WikimartWhat• BI tasks• Web analytics (in-house solution)• Recommendations on site• Data services for marketingWho• Core analytics team• Analytics members in other departments• IT site operations
  4. 4. ProblemToo time consuming or tooexpensive?• Data volume• # of data services
  5. 5. Map Reduce StandaloneDATA Map Reduce
  6. 6. Our ideaNew platform for “Big Data” tasks only• Start research on Map Reduce software• First patient - recommendation engineDifficulties- no planned budget -> Hadoop is free- no experts -> learn it- no hardware -> virtual cluster
  7. 7. Requirements for Hadoop• Easy scalable• Easy deployment• Easy integration• Less low level Java coding• SQL-like querries
  8. 8. Data flowDWH Data feeds
  9. 9. AccomplishmentsRecommendations• Collaborative filtering (item-to-item on browsing history, PIG)• Similar products (items attributes, PIG)• Most popular items (browsing history + orders, HiveQL)• Internal and external search recommendations (HiveQL)Some statistics after 1 year• >10% of revenue• 3 months to launch• Tens of gigabytes are processed 2 hours daily• 1 crash only (cluster lost power)Decision: Invest to Hardware cluster
  10. 10. End userInternal high-level languages• HiveQL• PigReporting• Pre-aggregated data for OLAP• RDBMS - front end• OLAP and Reporting software should support HiveQL
  11. 11. Data Integration• SQOOP • Parallel data exchange with RDBMS (MS SQL, MySQL, Oracle, Teradata… ) • Incremental updates • HDFS, Hive, HBASE• Talend Open Studio
  12. 12. Hadoop vs RDBMS• Never replace RDBMS: • Latency • Weak capabilities of HiveQL vs SQL• Only some tasks with offline processing: • Machine learning • Queries to Big tables • ….• Real time: NOSQL
  13. 13. Hadoop myth Terabytes? Petabytes? Big tasks!
  14. 14. Conclusion• Hadoop is not Rocket Science• Intermediate data can be Big DataStarter kit• Hadoop management system• Virtual hardware (cloud, virtual servers, etc)• Offline data tasks• Pig or HiveQL• Sqoop: import data from existing data sources
  15. 15. Thank you!!! rzykov@gmail.comlinkedin.com/in/romanzykov