الأحد، 1 يوليو 2018

Is it really Big!

Is it really Big!




Source: http://mattturck.com/big-data-landscape/

السبت، 25 نوفمبر 2017

The 8 worst predictive modeling techniques

Based on my opinion. You are welcome to discuss. Note that most of these techniques have evolved over time (in the last 10 years) to the point where most drawbacks have been eliminated - making the updated tool far different and better than its original version. Typically, these bad techniques are still widely used.
  1. Linear regression. Relies on the normal, heteroscedasticity and other assumptions, does not capture highly non-linear, chaotic patterns. Prone to over-fitting. Parameters difficult to interpret. Very unstable when independent variables are highly correlated. Fixes: variable reduction, apply a transformation to your variables, use constrained regression (e.g. ridge or Lasso regression)
  2. Traditional decision trees. Very large decision trees are very unstable and impossible to interpret, and prone to over-fitting. Fix: combine multiple small decision trees together instead of using a large decision tree.
  3. Linear discriminant analysis. Used for supervised clustering. Bad technique because it assumes that clusters do not overlap, and are well separated by hyper-planes. In practice, they never do. Use density estimation techniques instead.
  4. K-means clustering. Used for clustering, tends to produce circular clusters. Does not work well with data points that are not a mixture of Gaussian distributions. 
  5. Neural networks. Difficult to interpret, unstable, subject to over-fitting.
  6. Maximum Likelihood estimation. Requires your data to fit with a prespecified probabilistic distribution. Not data-driven. In many cases the pre-specified Gaussian distribution is a terrible fit for your data.
  7. Density estimation in high dimensions. Subject to what is referred to as the curse of dimensionality. Fix: use (non parametric) kernel density estimators with adaptive bandwidths.
  8. Naive Bayes. Used e.g. in fraud and spam detection, and for scoring. Assumes that variables are independent, if not it will fail miserably. In the context of fraud or spam detection, variables (sometimes called rules) are highly correlated. Fix: group variables into independent clusters of variables (in each cluster, variables are highly correlated). Apply naive Bayes to the clusters. Or use data reduction techniques. Bad text mining techniques (e.g. basic "word" rules in spam detection) combined with naive Bayes produces absolutely terrible results with many false positives and false negatives.
And remember to use sound cross-validations techniques when testing models!
Additional comments:
The reasons why such poor models are still widely used are:
  1. Many University curricula still use outdated textbooks, thus many students are not exposed to better data science techniques
  2. People using black-box statistical software, not knowing the limitations, drawbacks, or how to correctly fine-tune the parameters and optimize the various knobs, or not understanding what the software actually produces.
  3. Government forcing regulated industries (pharmaceutical, banking, Basel) to use the same 30-year old SAS procedures for statistical compliance. For instance, better scoring methods for credit scoring, even if available in SAS, are not allowed and arbitrarily rejected by authorities. The same goes with clinical trials analyses submitted to the FDA, SAS being the mandatory software to be used for compliance, allowing the FDA to replicate analyses and results from pharmaceutical companies.
  4. Modern data sets are considerably more complex and different than the old data sets used when these techniques were initially developed. In short, these techniques have not been developed for modern data sets.
  5. There's no perfect statistical technique that would apply to all data sets, but there are many poor techniques.
In addition, poor cross-validation allows bad models to make the cut, by over-estimating the true lift to be expected in future data, the true accuracy or the true ROI outside the training set.  Good cross validations consist in
  • splitting your training set into multiple subsets (test and control subsets), 
  • include different types of clients and more recent data in the control sets (than in your test sets)
  • check quality of forecasted values on control sets
  • compute confidence intervals for individual errors (error defined e.g. as |true value minus forecasted value|) to make sure that error is small enough AND not too volatile (it has small variance across all control sets)
Conclusion
I described the drawbacks of popular predictive modeling techniques that are used by many practitioners. While these techniques work in particular contexts, they've been applied carelessly to everything, like magic recipes, with disastrous consequences. More robust techniques are described here.
Related article:

الثلاثاء، 30 أغسطس 2016

Setting up Python in Windows 10

Installing Python under Windows 10 operating system.
Ready? Here’s your quick guide:
Set up Python on Windows 10
1. Visit the official Python download page and grab the Windows installer for the latest version of Python 3. A couple of notes:
  • Python is currently available in two versions — Python 2 and Python 3. For beginners, that can be confusing. In short, Python 3 is where the language is going; Python 2 has a large base of existing users but isn’t developing beyond bug fixes. Read this for more.
  • By default, the installer provides the 32-bit version. There’s also a 64-bit version available. I’ve generally stuck with 32-bit for compatibility issues with some older packages, but installing is so easy you can experiment with either.
2. Run the installer. You’ll have two options — choose “Customize Installation.”
3. On the next screen, check all boxes under “Optional Features.” Click next.

4. Next, under “Advanced Options,” set the location where you want to install Python. For ease, I use:
C:\Python35-32
That refers to an installation of 32-bit Python 3.5.
5. Next, set the system’s PATH variable to include directories that include Python components and packages we’ll add later. To do this:
  • Open the Control Panel (easy way: right click the Start Menu icon and select Control Panel).
  • In the Control Panel, search for Environment; click Edit the System Environment Variables. Then click the Environment Variables button.
  • In the User Variables section, we will need to either edit an existing PATH variable or create one. If you are creating one, make PATH the variable name and add the following directories to the variable values section as shown, separated by a semicolon. If you’re editing an existing PATH, the values are presented on separate lines in the edit dialog. Click New and add one directory per line.
C:\Python35-32;C:\Python35-32\Lib\site-packages\;C:\Python35-32\Scripts\
6. Now, you can open a command prompt (Start Menu | Windows System | Command Prompt) and type:
python
That will load the Python interpreter:
Python 3.5.1  (v3.5.1:37a07cee5969, Dec 6 2015, 01:38:48) [MSC v.1900 32 bit (Intel)] on win 32
Type "help", "copyright", "credits" or license for more information.
>>>
Because of the settings you included in your PATH variable, you can now run this interpreter — and, more important, a script — from any directory on your system.
Type exit() and hit Return to exit the interpreter and get back to a C: prompt.
Optional: Set up useful Python packages
Python 3 comes with the package installer pip already in place, which makes it super easy to add useful packages to your Python installation. The syntax is this (replace some_package with a package name you want to install):
pip install some_package
1. Let’s add a couple of must-have utilities for web scraping: Requests and BeautifulSoup. You can use pip to install them all with one command:
pip install beautifulsoup4 requests
2. csvkit, which I covered here, is a great tool for dealing with comma-delimited text files. Add it:
pip install csvkit

You’re now set to get started using and learning Python under Windows 10. If you’re looking for a guide, start with the Official Python tutorial.

Source:
http://www.anthonydebarros.com/2015/08/16/setting-up-python-in-windows-10/

Have fun :) 

Easy way to creat HTML table

الخميس، 4 فبراير 2016

Adding adb to your PATH (Mac)

Add to PATH for every login

in your terminal, navigate to home directory
cd
create file .bash_profile
touch .bash_profile
open file with TextEdit
open -e .bash_profile
insert line into TextEdit
export PATH=$PATH:/Users/username/Library/Android/sdk/platform-tools/
save file and reload file
source ~/.bash_profile
check if adb was set into path
adb version

One liner version
Echo your export command and redirect the output to be appended to .bash_profile file and restart terminal. (have not verified this but should work)
echo "export PATH=$PATH:/Users/username/Library/Android/sdk/platform-tools/ sdk/platform-tools/" >> ~/.bash_profile