About me: - 20 years working on data logistics for big data Projects on variety of clients - 10+ years working with python on data logistics - live in manitou springs - currently work for IBM as a data architect responsible for their security data warehouse - presented on this topic often
I didn't say “Big Data Project” - a big social networking site with 1 PB of content may not be doing as much analysis – may not require as many feeds Many would say this is the hardest part of data science Many would say this can consume 90% of a data science budget
As I'll get to in the next slide – you will probably have ***many*** feeds This shows an ideal security data warehouse set of feeds 24 FEEDS – but it could really be > 50
Firewall only - stuck with looking for patterns - might identify scans - might identify recon - will miss all distributed attacks Firewall+ - can tell if a scan came from a whitelist - can see if activity involves known bad guys - can see if activity involves high-value, Or vulnerable assets
Acknowledgements to Mike Koenig, and Drum 8. “An Upsetting Theme” by Kevin MacLeod. Licensed under Creative Commons “Attribution 3.0″ http://creativecommons.org/licenses/by/3.0/ and used here by permission, and with appreciation and thanks. Herbert Morrison’s on-the-scene recordings of the Disaster are Public Domain. Thanks to http://www.americanrhetoric.com for access.
Above example – problem won't disappear for 11 months. Users will be reminded of problem until it does. This is unlike transactional system, in which evidence of problems is hidden. Quality problems are one of the top reasons for analytical system failure. Examples: - Country threaten to go to UN if my company didn't retract an apology for its wrong analysis based on my data. Pretty intense.
Source systems won't tell you of changes they've made Many business complex feeds to maintain
http://creativecommons.org/licenses/by/2.0/deed.en Examples: - A system I'm familiar with is spending 4x what we're spending on hardware & support and loads 1/8000 our speed.
http://creativecommons.org/licenses/by-nc-sa/2.0/deed.en http://www.flickr.com/photos/slworking/5328601506/ You could eventually paint yourself into a corner – in which the maintenance of your feeds is nearly impossible to keep up with. Examples: - I know of some systems that take 6 months to build feeds. Others that can do the exact same feed in 1 month. -
- ETL Tools aren't silver bullets - XML isn't a silver bullet - Your experience building transactional systems won't help you This is not your world, it's your father's world. It's the world of mainframe batch systems from the 60s & 70s: - Few streams - Web services are too slow for the big feeds - No fat object layers - No record-by-record transactions + batch processing + bulk loading + merging of files
Gorillas don't scale – king kong couldn't exist because the square-cube law would require his bones to be disproportionately larger in cross-section at that size. Likewise, the work to build and maintain 50 feeds is more than 50x the work to do 1: Overhead services become more important – and take up more time Feeds have interdependencies Plus, they don't age terribly well – as you discover that upstream systems make changes, say annually, without telling you.
You need consistency to keep maintenance costs low. Too much inconsistency and you'll have an unmaintainable nightmare. But you need adaptability to work around source system requirements. Too much consistency here and you'll be unable to add new data. Ex: - you may have to use a client library in some other language - you may have to use RSS, SSL, RMI, etc - you may have an extract on the other side of a firewall
These two worlds just don't talk much. Especially since most ETL solutions have been closed source – it's a domain that's invisible to open source projects. Plus, ETL just isn't SEXY. Now that Big Data projects are happening in Corporate environments, and open source ETL is getting coverage – it's getting more visibility.
From http://professional.robertbui.com/2009/10/kettle-cuts-80-off-data-extraction-transformation-and-loading/ And most solutions involve diagraming your feed, and the solution either: - generates code - runs metadata through an engine
CASE tools were pretty much abandoned by the mid-90s But not for ETL – since its main adherents were those that didn't program much anyway So, they've lingered. And so has the myth that ETL is too hard to write by hand. In the late 90s the Meta Group released a study that showed that COBOL programmers were more productive than the users of any ETL software.
My apologies to the Ruby guys who are all sick of this cartoon by now
Python for Data Logistics
Using Python for Data LogisticsKen FarmerData Science and Business Analytics Meetuphttp://www.meetup.com/Data-Science-Business-Analytics/events/120727322/2013-06-25
About Data LogisticsMy definition: Management of Data in MotionWhich includes: Extract, Transform, Validation,Change Detection, Loading,Summarizing, Aggration(and some other stuff I dont care about*)In Context: A part of every big data analytical projectPrimary objective: Make analysis efficient & effective* SOA, Enterprise Service Buses (ESB), Enterprise Application Integration (EAI) etc. But since this these dont drive big dataanalytics were not going to talk about them.
Data Logistics Characteristics- there will be many flowsNote:● There may be many sources ofany type of data● There will be many differentsource constraints – operatingsystems, networks, etc● There will be upstreamchanges that will not becommunicated – you will justsee them in the dataTypical Large Security Data Warehouse
Data Logistics CharacteristicsSide Note - this is why there are many flowsLots of low-hanging fruitA year of data mining willproduce almost nothing- or -1 Feed 11 FeedsSo, which will produce the best analysis?
Data Logistics Characteristics- and each flow can be complexParts not shown:● File Movement● Logging & Auditing & Alerting● Process Monitoring● SchedulingConsiderations not shown:● Recovery● Performance with High Volumes● Management
Data Logistics Characteristics- and theres no simple alternativeThe Great Idea The Sad RealityNo delta processing ● Explodes data volumes● Reduces functionalityNo lookups ● Explodes data volumes● Reduces reporting query performanceNo dimensions ● Explodes data volumes● Reduces reporting functionality● Reduces reporting query performanceNo validation ● Increases maintenance costs● Increases reporting errorsNo standardization ● Increases reporting costs● Increases reporting errors● Increased documentation costsNo management features ● Decreases reliability● Increases maintenance costs
Data Logistics NightmaresSo, whats the worst that can happen anyway?
Nightmare #1 – Data QualityJan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec0102030405060708090100ACMEWidget Production by MonthMonthWidgets● Credibility● Value● Productivity
Data Logistics- Most Common Nightmare Root CausesHow the heck did we get here?
Root Cause #1 - magical thinkingThere are no fairieslikewise there are no silver bulletsand your CRUD experience wont help you
Root Cause #2 - non-linear scalabilityGorillas dont scale gracefullyNeither will your feedshe problem isnt performance -Its maintenance. Dependencies,cascading errors, and institutionalKnowledge.
Root Cause #3– too much consistency or adaptabilityThese two conflicting forces are at oddsYou need a balanceYou have to have consistencyTo help with learning curvesAnd organization.You have to have adaptabilityTo get access to all the dataSources youll want.
ETL to the Rescue- data logistics from the corporate world!● The corporate worldstarted working on this 20years ago● Its still a hard problem, butits less of a nightmare● Starting to make inroadsto Data Science/Big Dataprojects
ETL- Batch Pipelines not messages ortransactionsData is batchedFeeds are organized like assembly or pipe linesEach feed is broken into different programs / steps
ETL- Most tools use diagram-driven-developmentWhich seemsgreat to almost allmanagementAnd seemspretty cool for awhile to somedevelopers
ETL- Most tools use diagram-driven-developmentBut thensomeone alwayshas to over do itAnd we arereminded thattools are seldomsolutions
ETL- So all is not wonderfulETL – the last bastion of Computer-Aided Software Engineering (CASE) toolsFeature ETLToolCustomCodeUnit test harnesses no yesTDD no yesVersion control flexibility no yesStatic code analysis no yesDeployment toolflexibilityno yesLanguage flexibility no yesContinuous Integration no yesVirtual environments no yesDiagrams yes yesSo, why dont we usemetadata-driven or code-generation tools foreverything?Why not use tools likeFrontpage for all websites?
ETL- So, Buy (& Customize) vs BuildThe ETL Tool Paradox:● Programmers dont want to work on it● But can only handle 80% of the problem without programmingWhere the Buy option is a great fit:● 100+ simple feeds● Lack of programmer culture● Standard already existsMost typically – the “corporate datawarehouse” - a single database for an entirecompany (usually a bad idea anyway)
Python- a perfect fit for data logistics● You can use the same language forETL, systems management anddata analysis● The language is high-level andmaintenance-oriented● Its easy for users to understand thecode● It allows you use use all theprogramming tools● Its free● Its a language for enthusiasts● And its fun- http://xkcd.com/353/
Python- Build ListFor each Feed Application● Program: Extract● Program: Transform● Config: File-Image Delta● Config: Loader● Config: File MoverServices, Libraries and Utilities● Service: metadata, auditing & logging,dashboard● Service: data movement● Library: data validation● Utility: file-image delta● Utility: publisher● Utility: loader
Python- Typical Module ListThird-Party● appdirs● database drivers● sqlalchemy● pyyaml● validictory● requests● envoy● pytest● virtualenv● virtualenvwrapperStandard Library● os● csv● logging● unittest● collections● argparse● functoolsEnvironmentals● Version control – git, svn, etc● Deployment – Fabric, Chef, etc● Static analysis – pylint● Testing – pytest, tox, buildbot, etc● Documentation - sphinxBottom line: a mostly vanilla and very free environment will get you very far
Python ETL Components- Scheduling● Typically cron● Daemon if you want more than one run > minute● Should have suppression capability beyond commentingout the cron job● Event-driven > temporally-driven● Need checking for more than one instance running● Level of effort: very little
Python ETL Components- Audit System● Analyze performance & rule issues over time● Centralize alerting● Level of effort: weeks
Python ETL Components- File TransporterFile movement is extremely failure-prone:- out of space errors- permission errors- credential expiration errors- network errorsSo, use a process external to feed processing to move files – andsimplify their recovery.Note this is not the same as data mirroring:- moves files from source to destination- renames file during movement- moves/deletes/renames source after move- So, you may need to write this yourself – rsync is not idealLevel of Effort: pretty simple, 1-3 weeks to write reusable utility
Python ETL Components- Load UtilityFunctionality● Validates data● Continuously loads● Moves files as necessary● May run delta operation● Handles recoveries● Writes to audit tablesBottom line: pretty simple, 1-3 weeks to write reusable utility
Python ETL Components- Publish UtilityFunctionality● Extracts all data since the last time it ran● Can handle max rows● Moves files as necessary● Handles recoveries● Writes to audit tables● Writes all data to a compressed tarballBottom line: pretty simple, 1-3 weeks to write reusable utility
Python ETL Components- Delta UtilityFunctionality● Like diff – but for structured files● Distinguishes between key fields vs non-key fields● Can be configured to skip comparisons of certain fields● Can perform minor transformations● May be built into Load utility, or a transformation libraryBottom line: pretty simple, 1-3 weeks to write reusable utility
Python Program- Simple Transformdef transform_gender(input_gender):“”” Transforms a gender code to the standard format.:param input_gender – in either VARCOPS or SITHOUSE formats:returns standard gender code“””if input_gender.lower() in [m,male,1,transgender_to_male]:output_gender = maleelif input_gender.lower() in [f,female,2,transgender_to_female]:output_gender = femaleelif input_gender.lower() in [transsexual,intersex]:output_gender = transgenderelse:output_gender = unknownreturn output_genderObservation:Simple transforms &Rules can be easilyread by non-programmers.Observation: Transformscan be kept in a moduleand easily documented.Observation:Even simple transformsCan have a lot of subtleties.And are likely to be referencedOr changed by users.
Python Program- Complex Transformationdef explode_ip_range_list(ip_range_list):“”” Transforms an ip range list to a list of individual ip addresses.:param ip_range_list – comma or space delimited ip rangesor ips. Ranges are separated with a dash, or use CIDR notation.Individual IP addresses can be represented with a dotted quad,integer (unsigned), hex or CIDR notation.ex: "10.10/16, 192.168.1.0 - 192.168.1.255, 192.168.2.3,192.168.3.5 – 192.168.5.10, 192.168.5, 0.0.0.0/1"“””output_ip_list = for ip in whitelist.ip_expansion(ip_range_list):output_ip_list.append(ip)return output_ip_listOk, this is a cheat – the complexity is in the libraryObservation:Complex transforms wouldThat would be a nightmare inA tool can be easy in Python.Especially, as in this case, whenTheres a great module to use.Observation:Unit-testing frameworksAre incredibly valuableFor complex transforms.
The Bottom LineThank You – Any Questions?The Good:● Python for attracting & retaining developers● Python for handling complexity● Python for costs● Python for adaptability● Python for modern development environmentThe Not Good:● Lack of good practices adds risk● Lack of a rigid framework requires disciplineThe Tangential:● Hadoop – who said anything about hadoop?