Big data arch_analytics
Upcoming SlideShare
Loading in...5
×
 

Big data arch_analytics

on

  • 502 views

 

Statistics

Views

Total Views
502
Views on SlideShare
473
Embed Views
29

Actions

Likes
0
Downloads
10
Comments
0

2 Embeds 29

https://www.linkedin.com 17
http://www.linkedin.com 12

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Good Morning! My Name is
  • I am Srinu Adira, I manage Business Solutions at LinkedIn. I am primarily responsible for providing and enabling data solutions for iterative business decisions. Today, I am going to talk about big data eco system at LinkedIn. We manage 100s of Terabytes of data by leveraging scalable infrastructure.
  • LinkedIn’s mission is to connect the world’s professionals to make them more productive and successful. There are total 640M professionals in the world. As of last week, we have 200MM registered members from all over the world. LinkedIn operates the world’s largest professional network on the Internet in over 200 countries and territories.
  • LinkedIn believes in connecting talent with opportunity at massive scale. LinkedIn highly leverages scalable infrastructure to track user behavior. We believe in transforming the way the world works.
  • LinkedIn has grown member base has grown gradually during initial years. For the past 3years, growth has been phenomenal. Yup, we have reached 200M mark. Current rate is approximately 2 per sec with over 2M companies have their presence on LinkedIn. More than 80% of fortune 100 companies use LinkedIn to hire professionals. Approximately 5.7B searches were done in 2012 on LinkedIn. Bottomline is we are growing fast and expanding the professional network.
  • Why data is important for LinkedIn? Data is everywhere and LinkedIn believes in data.
  • Our founder, Reid Hoffman believes in constant iterations of product improvements. Our site reflects the same belief as we keep adding new features.
  • Of course, our SVP is a firm believer in data as to fix something we need to measure it. That’s where data plays a critical role. Extend it further, what gets measured gets improved. Whats gets analyzed gets monetized.
  • How does LinkedIn’s network impacts our members? LinkedIn analyzes large amounts of data on daily basis. This analysis results in relevant and valuable products, business solutions and services. In turn these improvements in products and services reflect in member growth and engagement. These improvements, in turn, generate more data. Cycle of improvements continues.
  • Who are the main drivers behind theses solutions?Our business analytics teams leverage these solutions to measure, analyze and predict the growth.Our sales analytics use the results of these analysis for improving the sales cyclesOur marketing teams use to fine-tune their campaigns such as emailsTalent Connect leverages it to match job and members vice-versaLast but not least, our business operations leverage these solutions to assess the pulse of our business
  • In order to enable business decisions, data insights provide much needed analysis. In this case, for eg, we can provide typical company level details such as total members, connections, employees viewed provide good overview of growth and engagement on LinkedIn for a given company. Another example, if you look at the company cloud which provide employee inflow and outflow. This kind of analysis equips our business with much needed data input to improve further.
  • What kind of solutions we generate? If you look into this cycle, as a step 1 - we focus on segmentation of our members with the help of standardization. This segmented data can be used for propensity modeling to fine-tune our target audience. In turn – this model helps in targeting our member base with richer usage experience. Besides targeting, we leverage data to come up with business forecasting to help our sales teams. Also, we analyze the churn and calculate lifetmevalue of a customer improve our customer base. For our solutions, we leverage tools and technology extensively. Few of them are MPP systems such as Teradata and Aster, distributed systems such as Hadoop for storage and processing. Besides MPP and hadoop, Java, machine learning technologies are used for various processes.
  • Who are the main drivers behind theses solutions?Our business analytics teams leverage these solutions to measure, analyze and predict the growh.Our sales analytics use the results of these analysis for improving the sales cyclesOur marketing teams use to fine-tune their campaigns such as emailsTalent Connect leverages it to match job and members vice-versaLast but not least, our business operations leverage these solutions to assess the pulse of our business
  • Now I will spend some time in going over our big data eco-system.
  • We broadly divide our data challenges into three dimensions. i.e. Volume, Variety and Velocity. This chart source from TDWI – courtacy - Phillip RussomAbout volume - we process terabytes of data in form of records, transactions, tables and files. About variety – we process various kinds of data such as structured, data base tables, unstructured, such some tracking data and semi structured.About velocity – our infrastructure is developed to incorporate various kinds of data streams. Such as Batch(files), Near time(tracking data), Real time (transactional data) besides streams
  • In order to accommodate accelerating volumes, increasing varieties and velocities, we are building our platforms and solutions that can scale, simplify and enable business decisions.
  • ERP data: transactional data, informationCRM: Marketing, campaigns, usage, engagementWeb: Engagement, pathing, Social Data, What happened? (BI and Reporting)AnalyzesReal time monitoring what the key business trends.Predictive analyzesTeradata too small
  • 3 major dimensions of data empower analyticsBehavioral Data Site EngagementOL TransactionsSearchesNavigation pathsRFMCommentsDiscussions….Demographic DataLocationGenderTitleFunctionSeniorityEducation….Social DataConnectionsCo-viewsSentiment trackingNPSFollowsEndorsementsForwardsCommentsShares….
  • High level data flow architecture consists of user interaction with application generates various data sets such as near line lookup data, online data store to maintain user transactions such as profile information. Also, offline data is generated in form of web logs. In turn, all these data sets are centralized in offline data store. As you can see from this high level data flow, no single tool/technology can handle these needs. Hence, we have to build our own combination of tools and technologies to meet specific requirements.
  • What do we use for data stack?As you can see from this slide, we use mix of commercial tools such Teradata and Oracle and open source technologies such as hadoop, kafka and voldemort.Next slide we will look into where and how these tools are used.
  • LinkedIn leverages, builds tools and contributes to open source. Transactional data like member profile data is maintained in Oracle and Espresso.
  • For nearline, linkedin leverages Voldemort as distributed key value store where as D-Graph is used for distributed graph engine.
  • For pipelines – we leverage kafka and databus to transporting data from online and weblogs to offline data store.
  • For data analysis/reporting, we leverage hadoop and teradata systems. Teradata and hadoop are used for processing large data sets to enable machine learning and analytics.
  • Now I will spend some time in going over our big data eco-system.
  • 3 major dimensions of data empower analyticsBehavioral Data Site EngagementOL TransactionsSearchesNavigation pathsRFMCommentsDiscussions….Demographic DataLocationGenderTitleFunctionSeniorityEducation….Social DataConnectionsCo-viewsSentiment trackingNPSFollowsEndorsementsForwards….
  • Many companies are stumbling blindly into social media marketing w/o a measurement strategy.Measurement is the kingThe first new model email campaign was launchedBy using half of the regular campaign volume -Within 2 hours, it triggered the NOC alerting system due to the order volume doubled on a w/w base.Within 9 hours, the new model already bypassed the old model which was launched 14 days on new sub acquisitions with a 7 day reminder email.Within the 7 days, we saw 300+% lift from the Gen model, and 480+% lift from the Sales model.The email open rate of new model over performed the old model.So far, we have not seen any increase on the opt-out rate. In fact, we have observed slightly decrease on the opt-out rate at high level.
  • 2M companiesWe have thousands sales people,How to prioritize sales?We predict which account, how much revenue they will spend?We predict within the account, who can make the decision, it is a mix of behaviors and engagementWe predict within LinkedIn, who has the highest likelihood to close the deal? Who can be leveraged to close the deal?
  • 4 principles at LinkedIn
  • 4 phases of analytics.We’d like to predict future.
  • 4 principles at LinkedIn
  • As our CEO says, always look for nextplay! Yes! Web 3.0 is all about data!!!
  • Thank You all! We are growing and need more professionals! We are hiring! Please reach out to me if you have questions!

Big data arch_analytics Big data arch_analytics Presentation Transcript

  • Big Data EcoSystem and Analytics @ LinkedInMay 16, 2013LinkedIn Confidential ©2013 All Rights Reserved
  • Srinu AdiraManager, Data Services(Business Solutions)LinkedIn Corporationhttp://www.linkedin.com/in/srinuadiraLinkedIn Confidential ©2013 All Rights Reserved 2
  • OutlineLinkedIn OverviewWhy Data is important for LinkedIn?Big Data EcosystemAnalytics at LinkedInLinkedIn Confidential ©2013 All Rights Reserved 3
  • Our MissionConnect the world’s professionalsto make them more productive and successfulLinkedIn Confidential ©2013 All Rights Reserved 4
  • 5The LinkedIn OpportunityConnect talent with opportunity at massive scale+Fundamentally transforming the way the world worksLinkedIn Confidential ©2013 All Rights Reserved
  • 200M+The World’s Largest Professional NetworkLinkedIn Confidential ©2013 All Rights Reserved 68173255901472006 2007 2008 2009 2010 2011 2012LinkedIn Members (Millions)*88%Fortune 100 Companiesuse LinkedIn to hire~2/secNew Members joining>2.9MCompany PagesProfessionalsearches in 2012~5.7B
  • Outline LinkedIn Overview Why Data is important for LinkedIn? Big Data Ecosystem Analytics at LinkedInLinkedIn Confidential ©2013 All Rights Reserved 7
  • LinkedIn Confidential ©2013 All Rights Reserved 8“If you are not embarrassed by the first versionof your product, you have launched it too late.”Reid Hoffman, Founder & Chairman LinkedIn Corp
  • LinkedIn Confidential ©2013 All Rights Reserved 9“What gets measured gets fixed.”David Henke, SVP Technology Operations, LinkedIn Corp
  • LinkedIn Confidential ©2013 All Rights Reserved 10The Power of LinkedIn’s Network EffectsMember growthand engagementRelevant andvaluable products,solutions & servicesCritical massof data
  •  Few Data Driven Products People You May Like Groups You May Like Jobs You May Be Interested In Whos Viewed Your Profile Companies You May Want To Follow11LinkedIn Confidential ©2013 All Rights Reserved
  • Data Insights (Sample)LinkedIn Confidential ©2013 All Rights Reserved 12
  • Data Solutions (Sample)LinkedIn Confidential ©2013 All Rights ReservedJava/MPP/HadoopML/StatisticalPackagesHadoopMPPMPP13
  •  Data Solutions Drivers Business analytics (e.g., data mining, enabledecision making) Sales analytics (e.g., customer segmentation,targeting) Marketing (e.g., campaigns) Data insights for Customers (e.g., Career siteanalytics) Business Operations (forecasting, business pulse)14LinkedIn Confidential ©2013 All Rights Reserved
  • Outline LinkedIn Overview Why Data is important at LinkedIn? Big Data Ecosystem Analytics at LinkedInLinkedIn Confidential ©2013 All Rights Reserved 15
  • Big Data at LinkedIn16* Chart from Philip Russom- Research Director: TDWILinkedIn Confidential ©2013 All Rights Reserved
  • LinkedIn Confidential ©2013 All Rights Reserved 17Big Data at LinkedIn Platform and solutions that Scale at cost with data complexity Simplify the data continuum across online, near-lineand offline Enable business decisions
  • 18What does “big data” mean at LinkedIn?ERP data…Social Data…CRM data…Web data…+∞+∞DataVolumeAnalytical Challenge & Complexity018LinkedIn Confidential ©2013 All Rights Reserved
  • 3 major data dimensions at LinkedIn19IdentityDataSocialDataBehavioralDataLinkedIn Confidential ©2013 All Rights Reserved
  • LinkedIn Confidential ©2013 All Rights Reserved 20Near-LineData StoreOnline DataStoreWebLogsBig Data at LinkedInHigh-level data environmentApplicationUsersChallenges so complex thatoff-the-shelf or a fewtechnologies can’t addressOffline DataStoreBuilt our own combination oftoolsets/ technologies tomeet specific requirements
  • LinkedIn Confidential ©2013 All Rights Reserved 21LinkedIn’s Sample Data StackLet’s do a deep dive to understand how the capabilities ofLinkedIn’s data stack meet our requirements
  • LinkedIn Confidential ©2013 All Rights Reserved 22UsersNear-LineData StoreOnline DataStoreApplication Offline DataStoreWebLogsLinkedIn Data Stack – OnlineSystems••CapabilitiesRich structures (e.g., indexes)Change capture capability
  • LinkedIn Confidential ©2013 All Rights Reserved 33UsersNear-LineData StoreOnline DataStoreApplication Offline DataStoreWebLogsLinkedIn Data Stack – NearlineSystems Capabilities•••Distributed Key value storeSearch platformDistributed Graph engineBobo SenseiVoldemortZoieD-Graph
  • LinkedIn Confidential ©2013 All Rights Reserved 34UsersOnline DataStoreApplication Offline DataStoreWebLogsLinkedIn Data Stack – PipelineSystems Capabilities•••Messaging for site events, monitoringChange data capture streamsReliable, consistent, low latency pipeNear-LineData Store
  • LinkedIn Confidential ©2013 All Rights Reserved 35UsersNear-LineData StoreOnline DataStoreApplication Offline DataStoreWebLogsLinkedIn Data Stack – OfflineSystems••CapabilitiesMachine learning, ranking,Relevance, SolutionsWarehouse and analytics
  • LinkedIn with Hadoop, Aster, and TeradataAster/TeradataBi-Directional ConnectorAster/TeradataHadoop ConnectorsData transformation& batch processing• Image processing• Search indexes• Graph (PYMK)• MapReduceBatch data transformations forengineering groups using HDFS +MapReduceLinkedIn Confidential ©2013 All Rights ReservedAnalytic Platform for datadiscovery• nPath Pattern/Path• Clickstream analysis• A/B site testing• Data Sciences discovery• SQL-MapReduceInteractive MapReduceanalytics for the enterprise usingMapReduce Analytics &SQL-MapReduceIntegrated DataWarehouse• Exec Dashboards• Adhoc/OLAP• Complex SQL• SQLIntegration with structured data,operational intelligence, scalabledistribution of analytics26
  • Outline LinkedIn Overview Why Data is important at LinkedIn? Big Data Ecosystem Analytics at LinkedInLinkedIn Confidential ©2013 All Rights Reserved 27
  • Several examples of businessanalytics evolution at LinkedInProductsMarketingSales12328
  • How we leverage data to support Marketing29Identity DataSocial DataBehavioral DataOverallAudienceTargetAudienceLinkedIn Confidential ©2013 All Rights Reserved
  • The closed-loop analytical framework30ExecutionReporting &businessintelligencePost campaignanalysisModel buildingand tuningCampaignplanning & designTestMeasureWhy?PredictDesignLinkedIn Confidential ©2013 All Rights Reserved
  • A example of using data to improve salesWhich account? Who? How?Step 1 Step 2 Step 3Identity DataSocial DataBehavioral Data31
  • How to provide 500 to 1000X impact?Insights portal for sales org.Easy: quickly find right infoFast: few seconds response time formost insightsScalable: 2M+ accounts/prospectsAccurate: mimic analyst/data scientist123432
  • Four stages of data analyticsWhat will happen?What happened?Why it happened?What is happening?HighHighBusinessValueAnalytical Challenge & Complexity033LinkedIn Confidential ©2013 All Rights Reserved
  • Use data to solve product problems-- A solution for answering A/B testing questionsLet technology work for usResults first, methodology laterBypass the charts and reportsSeveral thousands A/B tests are live, howto measure the performance?12334LinkedIn Confidential ©2013 All Rights Reserved
  • Nextplay : Web 3.0 – It’s all about data!!LinkedIn Confidential ©2013 All Rights Reserved 35
  • We are hiring!Thank you!36sadira@linkedin.comLinkedIn Confidential ©2013 All Rights Reserved