Hidden Gems found with HadoopPaco NathanLead, Analytics team @ IMVU.com
Ask Questions Early…‣ How do Hadoop and “Big Data” fit into the practice  of Continuous Deployment ?‣ Why don’t we simply l...
Continuous Deployment• IMVU: ~50 engineers work in parallel, builds push live every ~8 minutes• depends on “immune system”...
Continuous Deployment
Data Analytics• data usage downstream from production cluster is a lower priority• industry truism: data usage downstream ...
How Did We Get Here?• big Internet successes after 1997 holiday season…  AMZN, EBAY, then GOOG, Inktomi (YHOO Search)• con...
CAP Theorem• Eric Brewer, 2000: “You can have at most two of these properties for  any shared-data system … the choice of ...
Data Access Patterns• design patterns: originated in consensus negotiation  for architecture, then software engineering• c...
Access Patterns ↔ Frameworksfinancial transactions               general ledger in RDBMS            CAxad-hoc queries      ...
Access Patterns ↔ Frameworksfinancial transactions               general ledger in RDBMS            CAxad-hoc queries      ...
Data Prep → Modeling at ScaleAnalytics jobs performed in the cloud with Hadoop, R, etc.: • log clean-up, sessionization • ...
Finding Hidden Gems…data objects,          cloud-based         data access patterns           businesstransactions        ...
Related Resourceshttp://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/http://www.slideshare.net/pacoid...
Analytics Team, IMVU.com• IMVU: 90 employees in Bay Area, $40MM annual rev• largest virtual goods catalog: +6MM items UGC-...
Upcoming SlideShare
Loading in …5
×

Hidden Gems found with Hadoop

3,723 views

Published on

How does using Hadoop in the cloud for data analytics fit into the context of continuous deployment? Also taking a look at how one can use CAP to fit data access patterns with appropriate data frameworks.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,723
On SlideShare
0
From Embeds
0
Number of Embeds
45
Actions
Shares
0
Downloads
91
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • • prior teams: Jive, ShareThis, Adknowledge, HeadCase\n• worked with Ray while at ShareThis on our DW and recommender systems\n• 5 years experience with AWS, some at firms 100% in the cloud\n
  • • I’m a big believer in asking many questions up-front…\n• this talk examines how Hadoop fits into what IMVU is famous for: continuous deployment\n• we do some critical work with large data sets which makes RDBMS not a good fit\n
  • • CD allows many developers to respond to immediate needs, to experiment frequently\n• transparency, measurement, and consistent data-driven decisions are absolutely requisite\n
  • • in short, we can handle in minutes or hours, what other firms might take days, weeks, or months to do\n• decisions and actions are highly distributed, and engineering process is well disciplined\n\n\n
  • • my team works in Analytics, and our data usage is at a different priority than our production cluster\n• this is generally true throughout the industry\n• business strategy depends on analytics – \n• however, analytics work tends to break what we’ve so carefully instrumented\n\n\n\n
  • • how did we reach this condition?\n• 1997Q4 through 1998Q1, AMZN/EBAY/GOOG/YHOO redefined data use\n• revenue/data size, as a metric, fell through the floor\n• previous practices in relational DBs and BI no longer worked so well\n
  • • CAP theorem explains an inherent conflict there…\n• Internet transactions tend to need different kinds of data management than analytics\n• partitioned databases are a solution for one aspect, but in turn cause ETL to become a huge problem\n
  • • fortunately, there are patterns we can use to engineer around those conflicts…\n• providing that you don’t buy into “one size fits all” sales rhetoric from DB vendors\n• design patterns help here: choose data frameworks which fit your data access patterns\n
  • • hopefully, this tables states the CAP forfeits correctly – email me corrections, please :)\n• some of these patterns migrate well to the cloud; you may miss a big opportunity if you don't\n
  • • Redis is notable; rich/flexible atomic operations lend to not-shared cases\n• let’s drill-down into the Hadoop use cases…\n
  • • here are a variety of kinds of data preparation, discovery, modeling, and visualization for which my teams have used Hadoop and AWS\n• generally the goal is to automate most all of the work, as “pipelines”, and deliver data products/data services\n• these visualizations are actually some recent products from my team (less a few details stripped out)…\n• geolocation, topic trending from text analytics, measuring effects across the social graph, and comparing features vs. retention\n
  • • BTW, Redis provides an excellent “left brain” to pair with Hadoop “right brain”\n• this is not strictly “real-time” analytics, but cost-effective and follows guidance from CAP\n• in other words, scalable data frameworks based on prevalent data access patterns\n
  • • here is some further reading, which I will post online…\n
  • • oh, and yes we are hiring :)\n
  • Hidden Gems found with Hadoop

    1. 1. Hidden Gems found with HadoopPaco NathanLead, Analytics team @ IMVU.com
    2. 2. Ask Questions Early…‣ How do Hadoop and “Big Data” fit into the practice of Continuous Deployment ?‣ Why don’t we simply load all our data into Oracle, then generate reports and spreadsheets as needed ?‣ Given all the conflicting “NoSQL” options, how does an engineer design an effective data store ?‣ Is there one framework we can just buy and resolve all these annoying data issues ?‣ What kinds of analytics work can be performed using Hadoop in the cloud ?‣ Is IMVU currently hiring ? ☺
    3. 3. Continuous Deployment• IMVU: ~50 engineers work in parallel, builds push live every ~8 minutes• depends on “immune system” regression checks, progressive roll-outs• dedication to transparency and metrics: data-intensive company culture• extensive use of customer experiments (A/B testing) on millions of users• instrumentation, alerting, strict discipline on config and resource usage• Ops excellence, plus big investment in a finely tuned production environment http://www.quora.com/What-are-best-examples-of-companies-using-continuous-deployment http://www.slideshare.net/bgdurrett/3-reasons-you-should-use-continuous-deployment http://www.startuplessonslearned.com/2009/06/why-continuous-deployment.html
    4. 4. Continuous Deployment
    5. 5. Data Analytics• data usage downstream from production cluster is a lower priority• industry truism: data usage downstream almost never trumps the priority of direct revenue transactions• even so, business strategy depends on data analytics – which in practice, at scale, must live downstream from transactions• however, data analytics jobs tend to break that extensive work in testing/monitoring which allows for continuous deployment: - mission critical code which can’t be verified readily by unit tests - “slow queries” trip immune system, signaling regressions - likewise for large data transfers within production cluster - tightly configured environment vs. elastic resource needs
    6. 6. How Did We Get Here?• big Internet successes after 1997 holiday season… AMZN, EBAY, then GOOG, Inktomi (YHOO Search)• consider how, among tech firms, this metric: annual revenue per customer / operational data store size dropped more than 100x within a few years after 1997• “conventional wisdom” of RDBMS and BI tools became much less viable; however, business cadre which came of age when “spreadsheets were new” tends to carry along too much inertia to confront these issues pro-actively• one one hand, storage and processing costs plummeted… on the other hand, we must now work much smarter to extract ROI from “Big Data”, so methods must adapt• MapReduce and the Hadoop open source stack grew directly out of this context… but they only solve part of these problems
    7. 7. CAP Theorem• Eric Brewer, 2000: “You can have at most two of these properties for any shared-data system … the choice of which feature to discard determines the nature of your system.”• direct revenue apps in consumer Internet require consistency and partition tolerance• data analytics jobs for business uses generally require availability and eventual consistency, but tend to not tolerate highly partitioned data• ETL becomes an Achilles heal for “Lean Startup™”: ‣ agile/experiment-driven/scale-out, which leads to… strong consistency high availability ‣ provably-hard-to-detect metadata drift, which leads to… C A ‣ high-risk technical debt https://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf http://www.julianbrowne.com/article/viewer/brewers-cap-theorem RDBMS P eventual partition consistency tolerance
    8. 8. Data Access Patterns• design patterns: originated in consensus negotiation for architecture, then software engineering• consider the corollaries in large scale data wrangling…• essential advice: select data frameworks based on your data access patterns• in other words, decouple usage based on need – to avoid “one size fits all” blockers• let’s review some examples…
    9. 9. Access Patterns ↔ Frameworksfinancial transactions general ledger in RDBMS CAxad-hoc queries RDS (hosted MySQL) CAxreporting, dashboards like Pentaho CAxlog rotation/persistence like Riak xxPsearch indexes like Lucene, Solr xAPstatic content, archives S3 (durable storage) xAPcustomer facts like Redis, Membase xAPdistributed counters, locks, sets like Redis x A P*data objects CRUD key/value – like, NoSQL on MySQL CxPauthoritative metadata like Zookeeper CxPdata prep, modeling at scale like Hadoop/Hive/Cascading + R CxPgraph analysis like Hadoop + Redis + Gephi CxPdata marts like Hadoop/Hive/HBase CxP
    10. 10. Access Patterns ↔ Frameworksfinancial transactions general ledger in RDBMS CAxad-hoc queries RDS (hosted MySQL) CAxreporting, dashboards like Pentaho CAxlog rotation/persistence like Riak xxPsearch indexes like Lucene, Solr xAPstatic content, archives S3 (durable storage) xAPcustomer facts like Redis, Membase xAPdistributed counters, locks, sets like Redis x A P*data objects CRUD key/value – like, NoSQL on MySQL CxPauthoritative metadata like Zookeeper CxPdata prep, modeling at scale like Hadoop/Hive/Cascading + R CxPgraph analysis like Hadoop + Redis + Gephi CxPdata marts like Hadoop/Hive/HBase CxP
    11. 11. Data Prep → Modeling at ScaleAnalytics jobs performed in the cloud with Hadoop, R, etc.: • log clean-up, sessionization • roll-ups, slices, sampling, data cubes, visualizations • language identification, key phrase extraction • co-occurrence analysis, topic trending • custom search indexes • random forests and other classifiers • connected components, effects across social graph • virtual economy metricsBusiness use cases: • customer segmentation edoMpUsserD:IUN tcudorP ylppA lenaP yrotnevnI tneilC tcudorP evomeR lenaP yrotnevnI tneilC edoMmooRyM:IUN edoMmooRcilbuP:IUN • retention models ydduB ddA nigoL etisbeW vd edoMsdneirF:IUN edoMtahC:IUN egasseM a evaeL G1 :gniniamer ecaps sserddA • dekcilCeliforPyM:IUN anti-fraud edoMstiderCyuB:IUN tohspanS a ekaT egapemoH nwO tisiV elbbuB a epyT taeS egnahC wodniW D3 nepO dneirF ddA • revO tcudorP pilF lenaP yrotnevnI tneilC content recommendation lenaP tidE woN tahC teP yalP teP deeF 2 petS egaP traC esahcruP edaM remotsuC M215 :gniniamer ecaps sserddA gnihtolC no tuP • bew :metI na yuB ad optimization edoMeivoM:IUN ytinummoc ,tneilc :detratS weiV eivoM teP weN etaerC detrats etius tset :tseTytivitcennoC emag pazyeh dehcnuaL eciov mooRcilbuP tahC egasseM yadhtriB edoMlairotuT:IUN ybbol semag dehcnuaL noitartsigeR euqinU edoMpUss tcudorP yl tcudorP ev edoMmoo edoMmoo ydduB ddA nigoL etisb vd edoMsdnei edoMtahC: egasseM a G1 :gninia dekcilCelif edoMstider tohspanS egapemoH elbbuB a e taeS egna wodniW D dneirF ddA revO tcudo lenaP tidE woN tahC teP yalP teP deeF 2 petS ega M215 :gnin gnihtolC n bew :metI edoMeivo ytinummoc teP weN et detrats etiu emag pazy eciov moo egasseM y edoMlairot ybbol sem noitartsige
    12. 12. Finding Hidden Gems…data objects, cloud-based data access patterns businesstransactions data marts use cases Hive reporting ad-hoc queries, RDS framework reporting search, Lucene / Solr cache recommenders, data services Hadoop graph analysis, Redis sessionization, data services MySQLpartitions MySQL predictive modeling, partitions MySQL Gephi ETL S3 social graph, partitions factor analysis, R time series, data visualization
    13. 13. Related Resourceshttp://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/http://www.slideshare.net/pacoid/getting-started-on-hadoophttps://github.com/ceteri/ceteri-mapredhttp://redis.io/http://www.r-project.org/http://gephi.org/
    14. 14. Analytics Team, IMVU.com• IMVU: 90 employees in Bay Area, $40MM annual rev• largest virtual goods catalog: +6MM items UGC- Best Places to Work in Bay Area, 2011 & 2010- Red Herring Global 100 Tech Startup, 2010- Inc. 500, 2010http://www.imvu.com/jobs/@pacoid

    ×