SlideShare a Scribd company logo
1 of 35
Download to read offline
Big Data for Everyone Twitter:  #bd4e
Introduction to Big Data Steve Watt Hadoop Strategy  @wattsteve  #bd4e
What is “Big Data”? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
How were these issues addressed? ,[object Object],[object Object],[object Object],[object Object]
What is Apache Hadoop ? It is a cluster technology with a single master and multiple slaves, designed for  commodity hardware It consists of two runtimes, the Hadoop distributed file system ( HDFS ) and  Map/Reduce   As data is copied onto the HDFS, it ensures the data is blocked and replicated to other machines (node) to provide  redundancy Self contained  jobs  are written in Map/Reduce and submitted to the cluster. The jobs run in parallel on each of the machines in the cluster, processing the data on the local machine ( data locality ). Hadoop may execute or re-execute a job on any node in the cluster.  Node  failures are automatically handled  by the framework.
The Big Data Ecosystem ClusterChef / Apache Whirr / EC2 Hadoop Pig / WuKong /Cascading Cassandra / HBase Offline Systems (Analytics) Human Consumption BigSheets / DataMeer Hive / Karmasphere Provisioning Nutch / SQOOP / Flume Scripting DBA Non-Programmer Import/Export Tooling Visualizations Online Systems (OLTP @ Scale) NoSQL Commodity Hardware
Offline customer scenario Eric Sammer  Solution Architect  @esammer  #bd4e
Use Case: Product Recommendations ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Problems ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The Answer ,[object Object],[object Object]
 
Online customer scenario Matt Pfeil CEO @mattz62  #bd4e
What is Apache Cassandra? 03/17/11
Use Case: Managing Email ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Requirements ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Solution ,[object Object],[object Object],[object Object],[object Object],[object Object]
Where to next? The Adjacent Possible Flip Kromer CTO @mrflip  #bd4e
Something about Anything
Everything about Something
Bigger than One Computer
Bigger than Frontal Lobe
Bigger than Excel
what’s coming to help
myth of the “data base” ,[object Object],[object Object]
Managing & Shipping ,[object Object],[object Object],[object Object],[object Object],[object Object]
Data flutters by label Elephants make sturdy piles  {GROUP} Number becomes thought process_group Hadoop
Twitter Parser in a Tweet class TwStP < Streamer  def process line  a = JSON.load(line) rescue {}  yield a.values_at(*a.keys.sort)  endendWukong.run(TwStP)
pure functionality
pure fun ctionality
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Data Stores in Production
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Dev Ops: Rethink Hard
Still Blind ,[object Object],[object Object],[object Object],[object Object]
Human-Scale Tools ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Panel Discussion Stu Hood Software Engineer @stuhood  #bd4e
Thanks for coming! Stu Hood @stuhood  Flip Kromer @mrflip Matt Pfeil @mattz62 Eric Sammer @esammer Steve Watt @wattsteve

More Related Content

What's hot

Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for ArchitectsTomasz Kopacz
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
 
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionBig Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionEtu Solution
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azureDavid Giard
 
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup Qubole
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analyticsjoshwills
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Dataminds - ML in Production
Dataminds - ML in ProductionDataminds - ML in Production
Dataminds - ML in ProductionNathan Bijnens
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureNiels Naglé
 
Data infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedInData infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedInHari Shankar Sreekumar
 
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannAzure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannDatabricks
 
Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...
Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...
Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...Yahoo Developer Network
 
Ch-ch-ch-ch-changes....Stitch Triggers - Andrew Morgan
Ch-ch-ch-ch-changes....Stitch Triggers - Andrew MorganCh-ch-ch-ch-changes....Stitch Triggers - Andrew Morgan
Ch-ch-ch-ch-changes....Stitch Triggers - Andrew MorganMongoDB
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConfQubole
 

What's hot (20)

Big Data Usecases
Big Data UsecasesBig Data Usecases
Big Data Usecases
 
Big Data with Azure
Big Data with AzureBig Data with Azure
Big Data with Azure
 
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for Architects
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
 
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionBig Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
Introduction to Azure HDInsight
Introduction to Azure HDInsightIntroduction to Azure HDInsight
Introduction to Azure HDInsight
 
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
1. what is hadoop part 1
1. what is hadoop   part 11. what is hadoop   part 1
1. what is hadoop part 1
 
Dataminds - ML in Production
Dataminds - ML in ProductionDataminds - ML in Production
Dataminds - ML in Production
 
Ajug april 2011
Ajug april 2011Ajug april 2011
Ajug april 2011
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architecture
 
Data infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedInData infrastructure and Hadoop at LinkedIn
Data infrastructure and Hadoop at LinkedIn
 
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha DittmannAzure Databricks—Apache Spark as a Service with Sascha Dittmann
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
 
Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...
Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...
Apache Hadoop India Summit 2011 talk "Data Infrastructure on Hadoop" by Venka...
 
Ch-ch-ch-ch-changes....Stitch Triggers - Andrew Morgan
Ch-ch-ch-ch-changes....Stitch Triggers - Andrew MorganCh-ch-ch-ch-changes....Stitch Triggers - Andrew Morgan
Ch-ch-ch-ch-changes....Stitch Triggers - Andrew Morgan
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
 

Viewers also liked

Titan Newsletter Summer
Titan Newsletter SummerTitan Newsletter Summer
Titan Newsletter Summertitanfiresec
 
LPR Stands Brazil
LPR Stands BrazilLPR Stands Brazil
LPR Stands BrazilLPR Ltda
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataSteve Watt
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopSteve Watt
 
Les icônes : Qui sont-elles ? Quelles sont leurs pictos ?
Les icônes : Qui sont-elles ? Quelles sont leurs pictos ? Les icônes : Qui sont-elles ? Quelles sont leurs pictos ?
Les icônes : Qui sont-elles ? Quelles sont leurs pictos ? Sébastien Desbenoit
 
Enabled by Design at SXSW 2013 (without audio)
Enabled by Design at SXSW 2013 (without audio)Enabled by Design at SXSW 2013 (without audio)
Enabled by Design at SXSW 2013 (without audio)FutureGov
 
Creation d'une police de caractère
Creation d'une police de caractèreCreation d'une police de caractère
Creation d'une police de caractèreSerge Paulus
 
Kiwipycon2011 async-with-gevent-redis
Kiwipycon2011 async-with-gevent-redisKiwipycon2011 async-with-gevent-redis
Kiwipycon2011 async-with-gevent-redisalexdong
 
Real-Time Python Web: Gevent and Socket.io
Real-Time Python Web: Gevent and Socket.ioReal-Time Python Web: Gevent and Socket.io
Real-Time Python Web: Gevent and Socket.ioRick Copeland
 
Teaching with Sakai CLE from the Ground Up!
Teaching with Sakai CLE from the Ground Up!Teaching with Sakai CLE from the Ground Up!
Teaching with Sakai CLE from the Ground Up!LandonPhillips
 
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRealtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRick Copeland
 
Monkigras 2012: Networks Of Data
Monkigras 2012: Networks Of DataMonkigras 2012: Networks Of Data
Monkigras 2012: Networks Of DataMatt Biddulph
 
Nottingham hack soc
Nottingham hack socNottingham hack soc
Nottingham hack socSK53
 
Keynote | The Rise and Fall and Rise of Java | James Governor
Keynote | The Rise and Fall and Rise of Java | James GovernorKeynote | The Rise and Fall and Rise of Java | James Governor
Keynote | The Rise and Fall and Rise of Java | James GovernorJAX London
 
Java in the database–is it really useful? Solving impossible Big Data challenges
Java in the database–is it really useful? Solving impossible Big Data challengesJava in the database–is it really useful? Solving impossible Big Data challenges
Java in the database–is it really useful? Solving impossible Big Data challengesRogue Wave Software
 
Juicer - A fast template engine using javascript
Juicer - A fast template engine using javascriptJuicer - A fast template engine using javascript
Juicer - A fast template engine using javascriptpaulguo
 

Viewers also liked (20)

Titan Newsletter Summer
Titan Newsletter SummerTitan Newsletter Summer
Titan Newsletter Summer
 
Projects
ProjectsProjects
Projects
 
LPR Stands Brazil
LPR Stands BrazilLPR Stands Brazil
LPR Stands Brazil
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Les icônes : Qui sont-elles ? Quelles sont leurs pictos ?
Les icônes : Qui sont-elles ? Quelles sont leurs pictos ? Les icônes : Qui sont-elles ? Quelles sont leurs pictos ?
Les icônes : Qui sont-elles ? Quelles sont leurs pictos ?
 
Enabled by Design at SXSW 2013 (without audio)
Enabled by Design at SXSW 2013 (without audio)Enabled by Design at SXSW 2013 (without audio)
Enabled by Design at SXSW 2013 (without audio)
 
Creation d'une police de caractère
Creation d'une police de caractèreCreation d'une police de caractère
Creation d'une police de caractère
 
Kiwipycon2011 async-with-gevent-redis
Kiwipycon2011 async-with-gevent-redisKiwipycon2011 async-with-gevent-redis
Kiwipycon2011 async-with-gevent-redis
 
Real-Time Python Web: Gevent and Socket.io
Real-Time Python Web: Gevent and Socket.ioReal-Time Python Web: Gevent and Socket.io
Real-Time Python Web: Gevent and Socket.io
 
Teaching with Sakai CLE from the Ground Up!
Teaching with Sakai CLE from the Ground Up!Teaching with Sakai CLE from the Ground Up!
Teaching with Sakai CLE from the Ground Up!
 
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRealtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
 
Digpen search engine optimisation
Digpen search engine optimisationDigpen search engine optimisation
Digpen search engine optimisation
 
Monkigras 2012: Networks Of Data
Monkigras 2012: Networks Of DataMonkigras 2012: Networks Of Data
Monkigras 2012: Networks Of Data
 
Nottingham hack soc
Nottingham hack socNottingham hack soc
Nottingham hack soc
 
Keynote | The Rise and Fall and Rise of Java | James Governor
Keynote | The Rise and Fall and Rise of Java | James GovernorKeynote | The Rise and Fall and Rise of Java | James Governor
Keynote | The Rise and Fall and Rise of Java | James Governor
 
Java in the database–is it really useful? Solving impossible Big Data challenges
Java in the database–is it really useful? Solving impossible Big Data challengesJava in the database–is it really useful? Solving impossible Big Data challenges
Java in the database–is it really useful? Solving impossible Big Data challenges
 
Company Profile
Company ProfileCompany Profile
Company Profile
 
Juicer - A fast template engine using javascript
Juicer - A fast template engine using javascriptJuicer - A fast template engine using javascript
Juicer - A fast template engine using javascript
 
Stackato v5
Stackato v5Stackato v5
Stackato v5
 

Similar to Final deck

Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformGeekNightHyderabad
 
Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by KeylabsSiva Sankar
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptalmaraniabwmalk
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampSpotle.ai
 
AWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAmazon Web Services
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overviewNitesh Ghosh
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introductionsaisreealekhya
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopHazelcast
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 

Similar to Final deck (20)

Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by Keylabs
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop
HadoopHadoop
Hadoop
 
Lecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.pptLecture 5 - Big Data and Hadoop Intro.ppt
Lecture 5 - Big Data and Hadoop Intro.ppt
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
AWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloud
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Big data and Hadoop overview
Big data and Hadoop overviewBig data and Hadoop overview
Big data and Hadoop overview
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 

More from Steve Watt

Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerSteve Watt
 
Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerSteve Watt
 
Hadoop for the disillusioned
Hadoop for the disillusionedHadoop for the disillusioned
Hadoop for the disillusionedSteve Watt
 
Hadoop file systems
Hadoop file systemsHadoop file systems
Hadoop file systemsSteve Watt
 
Apache con 2013-hadoop
Apache con 2013-hadoopApache con 2013-hadoop
Apache con 2013-hadoopSteve Watt
 
Apache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructureApache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructureSteve Watt
 
Mining the Web for Information using Hadoop
Mining the Web for Information using HadoopMining the Web for Information using Hadoop
Mining the Web for Information using HadoopSteve Watt
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaSteve Watt
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchSteve Watt
 

More from Steve Watt (10)

Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and Docker
 
Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and Docker
 
Hadoop for the disillusioned
Hadoop for the disillusionedHadoop for the disillusioned
Hadoop for the disillusioned
 
Hadoop file systems
Hadoop file systemsHadoop file systems
Hadoop file systems
 
Apache con 2013-hadoop
Apache con 2013-hadoopApache con 2013-hadoop
Apache con 2013-hadoop
 
Apache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructureApache con 2012 taking the guesswork out of your hadoop infrastructure
Apache con 2012 taking the guesswork out of your hadoop infrastructure
 
Mining the Web for Information using Hadoop
Mining the Web for Information using HadoopMining the Web for Information using Hadoop
Mining the Web for Information using Hadoop
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Extractiv
ExtractivExtractiv
Extractiv
 

Recently uploaded

Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfROWELL MARQUINA
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 

Recently uploaded (20)

Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.How Tech Giants Cut Corners to Harvest Data for A.I.
How Tech Giants Cut Corners to Harvest Data for A.I.
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdf
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 

Final deck

  • 1. Big Data for Everyone Twitter: #bd4e
  • 2. Introduction to Big Data Steve Watt Hadoop Strategy @wattsteve #bd4e
  • 3.
  • 4.
  • 5. What is Apache Hadoop ? It is a cluster technology with a single master and multiple slaves, designed for commodity hardware It consists of two runtimes, the Hadoop distributed file system ( HDFS ) and Map/Reduce As data is copied onto the HDFS, it ensures the data is blocked and replicated to other machines (node) to provide redundancy Self contained jobs are written in Map/Reduce and submitted to the cluster. The jobs run in parallel on each of the machines in the cluster, processing the data on the local machine ( data locality ). Hadoop may execute or re-execute a job on any node in the cluster. Node failures are automatically handled by the framework.
  • 6. The Big Data Ecosystem ClusterChef / Apache Whirr / EC2 Hadoop Pig / WuKong /Cascading Cassandra / HBase Offline Systems (Analytics) Human Consumption BigSheets / DataMeer Hive / Karmasphere Provisioning Nutch / SQOOP / Flume Scripting DBA Non-Programmer Import/Export Tooling Visualizations Online Systems (OLTP @ Scale) NoSQL Commodity Hardware
  • 7. Offline customer scenario Eric Sammer Solution Architect @esammer #bd4e
  • 8.
  • 9.
  • 10.
  • 11.  
  • 12. Online customer scenario Matt Pfeil CEO @mattz62 #bd4e
  • 13. What is Apache Cassandra? 03/17/11
  • 14.
  • 15.
  • 16.
  • 17. Where to next? The Adjacent Possible Flip Kromer CTO @mrflip #bd4e
  • 20. Bigger than One Computer
  • 24.
  • 25.
  • 26. Data flutters by label Elephants make sturdy piles {GROUP} Number becomes thought process_group Hadoop
  • 27. Twitter Parser in a Tweet class TwStP < Streamer def process line a = JSON.load(line) rescue {} yield a.values_at(*a.keys.sort) endendWukong.run(TwStP)
  • 30.
  • 31.
  • 32.
  • 33.
  • 34. Panel Discussion Stu Hood Software Engineer @stuhood #bd4e
  • 35. Thanks for coming! Stu Hood @stuhood Flip Kromer @mrflip Matt Pfeil @mattz62 Eric Sammer @esammer Steve Watt @wattsteve

Editor's Notes

  1. What about cascading?
  2. What did a user buy? – We don’t want to recommend products they’ve already purchased in the past. What products did a user browse / hover / rate / add to cart – These all provide additional information we can use in making meaningful recommendations. Obviously you wouldn’t want to recommend a product a user has rated poorly (which may also indicate they own it but bought it from someone else). We can sometimes derive implicit information like this from explicit actions. Adding an item to a cart but not purchasing it indicates a strong “likely purchase.” User attributes can tell us more about the appropriate products to recommend for a user. Maybe we should recommend products within a user’s expected budget to increase likelihood of purchase. Also, we don’t want to offer deep discounts to users who can (or have historically) paid full price for products. Knowing what our margins on a product are can also impact our decisions. Inventory is also interesting – we don’t want to recommend something that’s not in stock!
  3. If we can read data sequentially at ~250MBps (SATA II, 3Gbps minus some overhead) we’re talking about 1200GB * 1024 (convert to MB) / 250MBps = 4952 seconds / 60 = ~82 minutes just to read the activity logs. This doesn’t account for growth. Distilling data (i.e. grouping users to interested categories and then picking interesting products in each category) drastically reduces accuracy. More data helps. Could we provide better recommendations by looking at year back? What if we looked at people who browsed / bought a significant number of products in common with you and based recommendations on that? What if we could take social data like FB, twitter into account and recommend products based on friends? Repeating this operation to keep data up to date could get expensive quickly.
  4. A batch operation to calculate recommendations daily probably makes sense. For each user, store a list of product IDs and present them pseudo-randomly or when a user views a product in the same category. Hadoop Map Reduce allows you to parallelize this type of operation so it can be run on hundreds or thousands of machines.
  5. Walk through the basic architecture for a recommendation engine running on Hadoop with RDBMS import / export and app servers logging to HDFS. Mention RDBMS could be HBase, Cassandra, or other “big data” data storage. Sqoop can do RDBMS / Hadoop imp / exp Flume and similar systems can facilitate reliable logging to HDFS from various sources.
  6. - From the Big Data leaders – Google + Amazon, Facebook
  7. Scalability = ability to add machines for horizontal/easy growth Redundancy = data replicated High performance = Fast.
  8. Cost = ~$300K
  9. So outside infochimps the internet is this wonderful smorgasbord of facts: With Google and Wikipedia and the rest you can find something about anything you seek.
  10. But we’re ready to move to a world where you can find heartier fare -- to find out everything about something
  11. Program according to this haiku and hadoop takes care of the janitorial work
  12. Here’s a working twitter stream JSON-to-TSV parser that fits in a tweet Run it with cat stream.json | ruby -r&apos;wukong/script&apos; -rjson parser_in_a_tweet.rb --map both fit comfortably in a tweet.
  13. what’s left is pure functionality,
  14. which to a programmer is pure fun.
  15. Now I know this sounds like the lunacy of a ritalin-addled architecture astronaut spending too much time on StackOverflow.
  16. Now I know this sounds like the lunacy of a ritalin-addled architecture astronaut spending too much time on StackOverflow.