Big Data, Baby Steps

  • 128 views
Uploaded on

Big Data Presentation for Utah Big Mountain Conference on 4/12/2014 @ Goldman Sachs in downtown SLC.

Big Data Presentation for Utah Big Mountain Conference on 4/12/2014 @ Goldman Sachs in downtown SLC.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
128
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
3
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Quote: This was a quick search and the earliest reference I saw – pretty sure someone else said this but the idea is the key.Me:Established consumer facing Web company looking to leverage our dataStarted with Hadoop and HBase in 2012 on AncestryDNAWhen we started, I looked for guidance – it was missingLearn from us: what works, what didn’t, how to adjustPlease understand, I had very little experience with Hadoop, Big Data, etc. before starting the AncestryDNA project. That project was very specific and focused. Two of my engineers and someone outside our company went to lunch. This encounter gave us the confidence to use HBase in our DNA project. Best $45 lunch I ever approved. . Moving to a general Big Data Analytics effort was very different – and yes, I’ve made mistakes. No real blog entries with someone saying “this is exactly what we did”. So I started writing Big Data posts in ACOM’s Technology Blog. Learn from what we’re going through.
  • Understand what you are getting into. Understand the Hadoop eco-system. Hadoop is presented as a “turn-key” and “fully baked” technology – it is definitely not.Architecture Diagrams (General one and a more specific Ancestry one). These should spark discussion.Build a teamBasics about Hadoop distributions and consultantsGo into details about custom logs at Ancestry and other companies. This is an area were we have learned.I will end with the top three things to remember – if nothing else, these are three things I want you to remember
  • This is an open question. I think the industry is starting up the “slope of enlightenment” but I’m worried we are still dropping down to the “trough of disillusionment”. It is also possible that different companies are on different parts of this graph. The other question is are you/us in this room early adaptors? Or the early majority?
  • Don’t be fooled. What you are doing in any Big Data Project is “Analytics”. Basic data collection and aggregation to Advanced Analytics Modeling/Machine Learning.Data: Story about a Tax Web Site that was going to look at how many times their users transferred money. What they found was the average user went through 28 clicks before finding the “Transfer Funds” item. Other items: how quickly to your insight (batch or near-real time; fraud vs. click stream)Visualize: If I could do anything different this is one of them. Set the expectations and visualize the outcome.Ken Rudin (Zynga and now FaceBook). TDWI 2013 Keynote in Chicago that is an eye opener. Someone to follow.
  • Base Hadoop (How close to the “bleeding” edge do you want to be?)WorkflowNoSQLData OrganizationLog (or any data) CollectionNear-Real Time StreamingFile System on top of HDFS
  • Hadoop 2.0 (Yarn, Ambari)Hbase on top of HDFSAzkaban – simple, easier to understand and use compared to OozieStinger with Hive – we are a SQL shop, HQL is closeKafka – Pub/Sub at scale (guarantees message delivery at least once. You will get multiple messages, app must handle that).Samza on top of Kafka for near-real time stream processing (reversed query)Cassandra and MongoDBOrange FS on top of HDFS. Normal file system view on top of HDFS is valuable.Log collection: Storing JSON logs – we can get away with ElasticSearch (Lucene and SOLR), Kibana (time series dashboards) LogStash will help if you need to format the log data, since we’re JSON, we don’t need it.Tableau has been a huge win for us. Spread like wildfire through the organization. More on the next slide.
  • If you look, there are 12+ different Hadoop distributions. Going to talk about the main five.Just setting up a Hadoop development environment (VM with Hadoop to develop against) takes time. Even with directions on a wiki a new developer will take about a week to get it right (install, check, blow away, reinstall). VMWareWhere are companies moving to? (Horton Works is gaining momentum)One of the big three distributions, approached us early on. Really pitched their services, training, and licensing (about $5K per node). I was in a meeting with our CFO and he said to me “Hey Bill, I hear that XYZ is saying they are about to close a multi-million dollar contract with Ancestry for their Hadoop distribution.” Well, the didn’t close the deal. That’s the environment you are in.
  • When I first started, I talked to a consultant who wrote a diagram very similar to this on a napkin. I still have it.First, collecting your logs/data/etc. KafkaStream processing for near-real time (we’re getting close to this)Raw data is moved to Hadoop. Always store the raw first, then process it.MapReduce data and create HIVE tables, then run other scripts or MapReduce jobs to create files that are sucked into the EDWWhat/where is your data warehouse? Could be Hadoop (Facebook). Could be a more traditional EDW (like us).Moved away from a 10 year old, Microsoft DW on SQL 2008 R2 to an MPP solution on ParAccel (now Matrix)Finally, you need a way to expose the data you are collecting back out to the web site/applications to take action
  • Color means –this is in place. White means this is coming.At the top we’re showing the logging aspect that is included in our Web Applications and ServicesYou’ve seen this before – we feed KafkaFeed Hadoop (initially we have elasticsearch and Kibana on 5 nodes – this will change over time)Feed aggregate data into the EDWVisualize with Tableau (Hadoop or DW)Why have a Production Hadoop Cluster (LinkedIn – Persons we think you know, jobs you might be interested in. Generated every night for all their users.)
  • Anyone here tried to hire Hadoop Engineers? Very few, in high demand, usually love their current job, and very expensive.We went a different way, we identified smart developers in the company and trained them. This takes more time and is an investment.Recently a Boston Hadoop Engineer connected to me through LinkedIn, he is looking to move to SF and found Ancestry. He read my blog entries.
  • This is a dangerous area. You can invest a lot of time and money here.You have a new team without much Hadoop experience, they can be a big help (this was the boat we were in)Prefer vendor agnostic providers.Find consultants with experience in your areaReally like both these companies. (Should we show them?)
  • How do the big boys (Google, FB, and others) handle SW development? They have two distinct groups. Infrastructure Engineers and Application Engineers. Infrastructure engineers build the common cross cutting concerns that every team uses. Logging is one of those cross cutting concerns (SLAs, monitoring, automated deployment, virtualization, A/B Testing). We’re not exactly the same but Ancestry is approaching logging the same way. The other item included on every entry is what we call our Big Data Headers. Not relying on date/time to stitch requests together.
  • This is an example of how we might stitch data together. When the user id is not present, we use the permanent anonymous id cookie. Once a user has logged in, we have their account id. In this example, the same user has visited with two different sessions, clearing their cookies in between. Once they log in, we can tie them up.
  • On each server, we allocate a specific amount of disk space for the log files. Log files roll over once they hit a specific size. We keep 10 active files before deleting old data. The goal is to hold about 1 days worth of logs on each host. If we haven’t picked up the logs by then, we lose data. We install a Kafka log scraper on each host that pulls the data written to these files. Kafka uses a very efficient transport mechanism for it’s producers. It sends multiple messages and compresses them. It is very efficient on the network.True confession. We designed this initially without validating how the data collection was going. Big miss. One simple way is to use an auto-incrementing value and include it in every log message. Look for breaks in the sequence. Another way is to keep track of messages sent in the logging aspect and the messages received in Hadoop. You need a stream to send the message counts every minute or so. Then count the messages received in Hadoop
  • Still have a lot to learn. Real challenge is rolling this out across a mature web site that produces a lot of data. Easier to start when your small and grow/change your technology as needed.Ultimately, we are looking to change our company culture. Allow the “data” to help us make decisions. Get out of those “gut” reactions that drive a product decision. Put a feature out, make sure it moves a metric we care about (Google, FB say 80% of all features don’t move a key metric), or take that feature down. A/B Testing is a key.Add in the fact we are moving from a Microsoft shop to a Java/Linux/Open Source development company. This is a huge shift.
  • No right way or wrong way. It has to be “your way”.You are doing Analytics not “Big Data”. That Analytics must impact the business (or why do it).Other amazing companies have made this transition. Find one that matches you and follow them. I love the NetFlix Architecture videos on YouTube (Adrian Cockcroft, older dude in a t-shirt, chaos monkey usually). LinkedIn has been very open with us and we’ve joined the Kafka and Samza projects. The popularity of HBase speaks for itself. Ancestry sponsored an HBase meet up in March. Very successful, very interesting to see. Join the communities and contribute.JoinInfoQ and attend QCon in SF (Nov, 2014). This is a non-Microsoft, open source conference that has amazing presentations. Or the Hadoop Summit in San Jose (June 3rd thru 5th)
  • Very impressed with Technology Companies in Silicon Valley. They share infrastructure code, don’t feel it is part of their IP. (IP is their data and algorithms.) Ancestry started attending the Yahoo! Big Data meet up and joined the HBase community. This has really opened us up, stimulated innovation, and provided direction for our teams.Companies in Utah can learn a lot from them. I believe we don’t share and collaborate nearly as much as our SF counterparts.Currently Reading:“Secrets of Analytical Leaders: Insights from Information Insiders”Author: Wayne EckersonWilling to discuss Big Data Projects and infrastructure with any company. The best way forward is to support each other.

Transcript

  • 1. 1 Big Data, Baby Steps “What Every Leader Should Consider When Starting a Big Data Initiative” April 12, 2014
  • 2. Goal for this presentation “Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it...” - Dan Ariely on Facebook Jan 6, 2013 and “others” 2 Why Me? Why Ancestry? • Established consumer facing Web company looking to leverage our data • Started with Hadoop and HBase in 2012 on AncestryDNA • When we started, I looked for guidance – it was missing • Learn from us: what works, what didn’t, how to adjust
  • 3. Agenda • What to consider before you start • Understand the Hadoop ecosystem – What pieces is Ancestry using and why? – Big Data architecture at Ancestry • Hadoop distributions • Big Data consultants • How to build your team(s) • Custom logs – Other companies and Ancestry specifics • Top three things to remember 3
  • 4. Gartner new technology hype cycles Where is Big Data (and Big Data Analytics) on this curve? 4 Source: Gartner August 2013
  • 5. What to consider before you start • Big Data, Business Intelligence, and Analytics are tied – Analytics is an umbrella term that represents the entire ecosystem needed to turn data into actions • Understand your “data” – Web click stream data, sales transactions, advertising data, fraud detection, sensor data, social data, etc. • Visualize your final goal and work backwards – Imagine (prototype) the dashboards, analytics, and actions that will be available • Deliver value to the business at each step – “Goal of analytics is not to produce actionable insights; the goal is to produce results.” Ken Rudin 5
  • 6. Understand the Hadoop ecosystem • Hadoop 2.0 and HDFS (Yarn) • Workflow • NoSQL • Data Organization • Log collection • Near Real-Time Stream Processing • NFS File System on HDFS 6
  • 7. What are the pieces Ancestry is using? We use or plan to use: Yarn and Ambari Forensics on log data: Visualization: (Graphs + Deep Zoom) 7
  • 8. Visualization Company that used traditional “Cubes” and Excel – Business Intelligence/Data Warehouse world has moved beyond cubes – Great product that didn’t work for us – People went back to using Excel – In two weeks, 30 people created 120+ dashboards and reports – Tied to an MPP Data Warehouse is changing our company – Created the “Wild, wild, west” - fixing with a blessed portal 8
  • 9. Hadoop distributions • Open Source, Active Community, Large Eco-System of Projects, requires more internal knowledge and support • First Distribution, Large “War Chest” (Cash Investment), Impala, and the Cloudera Console • Custom file system (API equivalent to HDFS) that improves performance, custom Hbase implementation, High Availability Features • Closest to Apache Hadoop, tested on Yahoo!’s 7000 node cluster before being released. • Several Cloud options: Google and Amazon. Quick and easy to get going. Great way to experiment and learn. Watch your data storage costs 9
  • 10. Typical Big Data architecture Cassandra Repo Users Properties User Properties User Segments Rules Defines Samza Stream Processing Stream A Stream B Stream C Kafka Stream Repo Runs on Hadoop System of Records Simple ETL Raw Data Global Properties & Models Marketing Segmentation and Targeting Managment Expose to the Web Site User Facing Stacks and Services Log Forwarder Kafka Producer EDW (MPP) Simple ELT MapReduce ETL Designs Actions Feeds 10
  • 11. Ancestry system diagram 11 Hadoop System of Records Dogwood ELT User 360 Services Initiative Kafka Log Forwarder EDW ParAccel MapReduce ETL Splunk Alternative Initiative Operation Monitoring Reporting Initiative Stream Kafka Samza Stream Processing Stream A Stream B Stream C Notification Service Mirror .Net Stack Java Stack JVM stack Vert.X stack Node.js Stack Python Stack Kafka Producer Aspect Aspect Aggr ETL ETL Kafka Actions Feeds Production HadoopTableau Elastic Search Kibana
  • 12. How to organize and build your team(s) • Hiring vs. training smart developers in your organization – Training ▫ Self-starters who can train themselves ▫ Online training that is free or with minimal cost ▫ Paid training for specific technologies – Promote your technology and people will reach out to you ▫ Bit of a chicken and egg problem • Key roles for the team – Developers who understand operations – Hadoop engineers – Team leaders and managers 12
  • 13. Big Data consultants • Lots of them, charging lots of money • Not all of them are created equal • Prefer consultants who are vendor agnostic • Find consultants who have experience in what you want to do • Check references 13
  • 14. Companies working with custom logs 14 • Scribe, Scuba, Hive, and Hadoop as the data warehouse infrastructure. Run over 10K Hive scripts daily to crunch log data. Analyst on each team to make sure logging is correct. • Uses a very simple interface similar to log4j to log data. How to keep this accurate? • Tried Scribe. Implemented Kafka and Avro to collect log data. Use a binary format with a schema registry. • Recently open sourced their log collecting infrastructure (Suro – Data Pipeline). “Used to be a web site that occasionally logged data. Now we’re a logging engine that occasionally serves as a web site.”
  • 15. Collecting custom logs at Ancestry • Framework piece with a “Logging Aspect” – Logging is a cross cutting concern – Avoid breaking changes – Annotations for parameter names (normalization layer) • Defined Big Data headers that must be present in every log (User ID, Anonymous ID, Session, Request ID, Client) – Stitch data together – Partitioned in Hive by day/month/year – JSON payload – Validate messages sent vs. messages received – Schema repository (long-term) 15
  • 16. Stitching data together 16
  • 17. Ancestry log collection details Each server • 10 rolling logs • Scraper process Validate your data collection infrastructure • Auto incrementing count in every log message • Count on Framework side (sender) and count on Hadoop (receiver) 17 Local Server Hard Drive Single Server Kafka Scrapper 10 rolling files Hadoop Log Sender Log Receiver
  • 18. Ancestry moving forward • Ancestry is not “done” - the journey continues – Still evolving and changing – My thinking and understanding has also changed • Means we will embrace new technologies in the future – Keep our eyes open and experiment • This is affecting the entire organization – Becoming more involved with Open Source and the communities that support it 18
  • 19. Top three things to remember • First and foremost, understand your needs – No clear right or wrong way – Keep it simple because simple scales • This is about Analytics and impacting the business • Find a company that fits you and follow them: – Netflix (cloud architecture, code for survival, simian army) – Facebook (HBase) – LinkedIn (Kafka, Samza, Azkaban) 19
  • 20. byetman@ancestry.com http://blogs.ancestry.com/techroots/ (Filter on Big Data or search for “Adventures in Big Data” in the title) Bill’s contact information 20 Bill Yetman VP of Engineering at Ancestry