Big Data Scripting

Social

@tsunamiide

tsunami.io

Earthquake Enterprises


Founder of Lift Analytics
 (F# IDE)



Big Data / Machine Learning
Applied Researcher



Business Intelligence



P...


Two main tasks on the Big Data Pipeline
are Getting More Data and Finding More
Factors

Get More
Data

Social

@tsunami...


Data Science is an inherently exploratory pursuit



Over 50% of the queries you write will only be
executed once



...


Most queries are exploratory in nature
 The decision on which query to run next often

depends on the result of the pr...


Clusters are a shared resource



A simple mistake may kill a big job after
12 hours of cluster processing time



Mi...


Democratized write access to the cluster



Meta-data is often kept as tribal
knowledge, team wikis, and files sent in...


Users will start to share datasets amongst
themselves without any formal
agreements or dependency management



Other ...


There are many more chances for making
mistakes, they have become easier to
make and they are far more costly.



Most...
Social

@tsunamiide

tsunami.io

Earthquake Enterprises
Upcoming SlideShare
Loading in …5
×

Hive London Meetup: Interactive (functional) programming with Apache Hive and F#

680 views

Published on

Talk on 31.10.2013 by Matthew Moloney the founder of Tsunami (tsunami.io). Matthew previously worked on Big Data tooling at eBay and Microsoft and is particularly interested in Functional Programming and Machine Learning.

Abstract: Many common mistakes in Hive programming are preventable and waste both user time and cluster time. Matt will present an interface that not only prevents these mistakes but is able to give you helpful hints while your typing.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
680
On SlideShare
0
From Embeds
0
Number of Embeds
20
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hive London Meetup: Interactive (functional) programming with Apache Hive and F#

  1. 1. Big Data Scripting Social @tsunamiide tsunami.io Earthquake Enterprises
  2. 2.  Founder of Lift Analytics  (F# IDE)  Big Data / Machine Learning Applied Researcher  Business Intelligence  Process Engineering Social @tsunamiide tsunami.io Earthquake Enterprises
  3. 3.  Two main tasks on the Big Data Pipeline are Getting More Data and Finding More Factors Get More Data Social @tsunamiide Find More Factors tsunami.io Machine Learning A/B Testing Earthquake Enterprises
  4. 4.  Data Science is an inherently exploratory pursuit  Over 50% of the queries you write will only be executed once  Very few queries are worth saving  Exploration involves writing a whole lot of new code and then throwing most of it away  There are no QA teams and no test suites  Many more new opportunities to make mistakes Social @tsunamiide tsunami.io Earthquake Enterprises
  5. 5.  Most queries are exploratory in nature  The decision on which query to run next often depends on the result of the previous query  Queries are often very short (e.g. 5 minutes)  Queues are often very long (e.g. 2 hours)  Mistakes waste a lot of your time Social @tsunamiide tsunami.io Earthquake Enterprises
  6. 6.  Clusters are a shared resource  A simple mistake may kill a big job after 12 hours of cluster processing time  Mistakes waste everyone's time Social @tsunamiide tsunami.io Earthquake Enterprises
  7. 7.  Democratized write access to the cluster  Meta-data is often kept as tribal knowledge, team wikis, and files sent in emails Social @tsunamiide tsunami.io Earthquake Enterprises
  8. 8.  Users will start to share datasets amongst themselves without any formal agreements or dependency management  Other people can now break your code Social @tsunamiide tsunami.io Earthquake Enterprises
  9. 9.  There are many more chances for making mistakes, they have become easier to make and they are far more costly.  Most of the time they are not even your fault and there is nothing you could have reasonably done to prevent them. Social @tsunamiide tsunami.io Earthquake Enterprises
  10. 10. Social @tsunamiide tsunami.io Earthquake Enterprises

×