The document discusses data science workflows on Hadoop. It describes data science as involving three phases - data plumbing to ingest and transform data, exploratory analytics to investigate and analyze data, and operational analytics to build and deploy models. It provides examples of tools used for each phase including Spark, Hadoop streaming, SAS, and Python for exploratory analytics, and MLlib and Spark for operational analytics. The document also discusses lambda architectures for handling both batch and real-time analytics.
Background as a scientist. Do genomics/life sciences especially.
Shameless plug for our new book.
Or instead, what is data science?
SCARES ME the most when I show up at clients.
Difficult to define, but…
One way to organize these things.
TF-IDF model
From simple theory to complicated practical implementation.
Any given operation on an image is not difficult.
Reliably integrating satellite data with complex/custom pipelines is difficult. Must coordinate many tasks.
Most similar to research/science/statistics.
You don’t really know what you’re doing.
Exploratory. Lot’s of tools to do this – Python, R, SAS, etc. BI tools (Tableau).
Doing it at scale more difficult.
Hadoop centralizes. No need to copy data for each application.
Bioinformatics spends lots of time mucking with different file formats in different systems. Many orgs are very siloed.
Most unique to Hadoop/big data.
Don’t want to train a model once.
Given model, want to deploy it. Update it.
If this is landscape of what data science is, what are some tools/recs?
~10 min mark
ETL tools. Traditional Hadoop.
Don’t want to say much except….
Most common thing: instrumentation and schemas. Need culture of data/telemetry.
Best stuff when you join data sets. Requires de-siloization. Requires centralized schemas.
Also Kafka
Ad Hoc
Focus on Accuracy, Visualization
Traditional stats tools like R, Python, SAS
Ad Hoc
Focus on Accuracy, Visualization
Traditional stats tools like R, Python, SAS
Thunder as framework on Spark.
Mahout is deprecated.
Another way to think about the tools is based on different features…
Probably detect a theme here.
Lot’s of tools in Hadoop have a dichotomy between online and offline.
Do we have to choose?