Hadoop in a Windows Shop Abuna Demoz – Abuna@AdGooroo.com Brad Vah – Bvah@AdGooroo.com Mike Schiro – Mschiro@AdGooroo.com Twitter: @AdGooroo @abuna
Who Is AdGooroo?• Founded in 2004• We are the largest provider of Search Intelligence in the world• Our customers include: – Agencies – CMOs – Marketing Managers – Digital Ad Sales – Over 4,000 users• Global Scale – 50 Countries – 14 Search Engines – 14 Ad Networks
Learning Curve• Where is Hadoop going to fit?• How do we leverage existing tools?• Linux can be less forgiving – rm –rf /*• Who names these things?
Integration Points• Active Directory != LDAP• Create a seamless user experience• Domjoin in 30 simple steps – Tip: It’s usually safe to blame Kerberos
Integration Points – Data Transfer• SMB works…mostly – Flaky connectivity – Relatively slow transfer for GigE• NFS – Client Services for NFS – Much faster transfer speeds
Integration Points – Data Transfer• MountableHDFS/HDFS_Fuse – Fuse -> NFS -> Windows • We tried it. You should not. – SCP (Windows) -> NFS -> Fuse • Messy, but it works. • Don’t often need to use it
Monitoring and Management• Operations Manager (MOM/SCOM) – Native Linux monitoring – Custom Management packs for Hadoop• Opalis – Workflow automation• Configuration Manager (SCCM) – Quest Management Xtensions for *nix
Final Thoughts• Hadoop and Windows can live together.• Microsoft is starting to figure out this whole “open-source” thing. – MSSQL connectors for Hadoop – ODBC driver for Hive – Interop initiatives• When in doubt; blame Kerberos.• Roll your own repo.
Environments• Windows – Visual Studio, SQL Server, etc – Physical workstations• Linux – Getting reacquainted with an old friend – New suite of tools – Cloudera VM • RAMRAMRAMRAMRAMRAMRAMRAMRAM
Languages• Java – Straightforward transition from the .NET world – Hmm…How do I create that JAR again?• Python/Bash – Utilized a lot more than expected• HiveQL – Simple transition from SQL – Custom UDFs
Unexpected Roadblocks - AVRO• Assumption: – Works with .NET • Can serialize files to be read by Java Map/Reduce• Reality: – .NET compatibility not fully baked • Any files written in .NET could not be read in Java. – C# side is not reading nor writing the header – JIRA: AVRO-823
Unexpected Roadblocks – Flume• Assumption: – We’ll use Flume for Windows• Reality: – Overkill for our needs – Implementation woes• Solution: – Custom log collector service – Converts data to AVRO file
Unexpected Roadblocks – Thrift• Assumption: – We’ll use Thrift to talk to HBase from .NET• Reality: – HBase.thrift does not support C# yet• Solution: – Convert Thrift Java code-gen to .NET • Some community work already done here (https://bitbucket.org/vadim/hbase-sharp)
As Advertised - Sqoop• Simple• Fast route to POC – Imports – Exports• Minor “gotchas” – Delimiters – Large exports to SQL Server • Use “--batch” mode
As Advertised - Hive• Very similar to SQL• “Quick” data analysis – Results without crippling your existing RDBMS• HBase storage handler – provides easy point of entry to data and data manipulation
Final Thoughts• Don’t overthink it! – Just because you can doesn’t mean you should• Modularity – Easy to be overwhelmed by all the moving parts – Flatten the learning curve by taking it one piece at a time