Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Setting up a mini big data architecture, just for you! - Bas Geerdink


Published on

In this session, we'll start from scratch and build a nice little software stack that you can use to experiment with big data software. At the end, I've shown the steps to take for setting up a virtual server with a NoSQL database, Hadoop, stream processing engine, and visualization tools. After importing the data, we'll have a modest result in the form of a visualization of some 'little' big data. This session will give you an introduction to the world of big data architecture, without getting too complex or fuzzy. There will be some theory, but the focus is on the practical things you need to do to get started. Bring your laptop if you want some hands-on experience right away! Join this session ff you want to understand what's under the hood of Cloudera, Hortonworks, and MapR, and want to play with modern open source software!

Published in: Technology
  • Be the first to comment

Setting up a mini big data architecture, just for you! - Bas Geerdink

  1. 1. Building a (mini) Big Data architecture Bas Geerdink 5 november 2014
  2. 2. About me • Work: ING • Education: Master’s degree in AI and Informatics • Programming since 1998 (C#, Java, Scala, Python, …) • Twitter: @bgeerdink • Email:
  3. 3. Introduction • Big Data – Volume, Velocity, Variety • Predictive Analytics / Machine Learning – Classification – Clustering – Recommendation • Today’s goal: – Start small, create a playground! – Learn some basic tools and techniques
  4. 4. Reference big data solution architecture
  5. 5. There are several out-of-the-box options to get started with big data development • On-premise: – Hortonworks – Cloudera – MapR – IBM InfoSphere BigInsights – HP Vertica – Oracle – Teradata – SAS • Cloud-based: – Amazon Elastic MapReduce – Microsoft Azure HDInsight – Google (App Engine, BigTable, Prediction API, …) – SAP HANA … however, we’ll set up our own environment!
  6. 6. Mahout features • Optimized for large datasets (millions of records) • Moving from Hadoop to Spark • Supervised learning – Classification: Naïve Bayes, Hidden Markov Models (NN), Random Forest – Logistic Regression (predict a continuous value) • Unsupervised learning – Clustering: k-Means, Canopy – Recommendations
  7. 7. Mahout Algorithms Size of dataset Mahout algorithm Execution model Characteristics Small SGD Sequential Uses all types of predictor vars Medium (Complementary) Naïve Bayes Parallel Prefers text, high training cost Large Random Forest Parallel Uses all types of predictor vars, high training cost Source: Cloudera (2011)
  8. 8. Example 1: newsgroups • Data: newsgroup items • 20.000 records • Train with Naïve Bayes Classifier • Categories: 20 newsgroups • Prediction: newsgroup of unclassified item
  9. 9. Example 2: hospital treatment • Data: hospital surgeries in 50s, 60s, 70s • 306 records • Train with logistic regression • Features: – Age of subject – Year of treatment – Number of positive axillary nodes • Prediction: survival rate • Visualization: D3.js
  10. 10. Summary
  11. 11. Want to move on? • Follow courses on Coursera – Machine Learning: – Introduction to Data Science: • Read Hadoop/Mahout/R tutorials and books • Get some ML datasets: – – – available-for-free • Expand the ecosystem: Hive, Pig, HBase, Spark, …