Different fields of science, economics, engineering, and marketing accumulate and store data primarily in electronic databases. Appropriate and well-established decisions should be made using the data collected. It is practically impossible to make sense of datasets containing more than a handful of data points without the help of computer programs. To be certain of the insights that the collected data provides and to make further decisions, data mining is performed where we go through distinctive analysis processes.
Exploratory data analysis is key, and usually the first exercise in data mining. It allows us to visualize data to understand it as well as to create hypotheses for further analysis. The exploratory analysis centers around creating a synopsis of data or insights for the next steps in a data mining project.
EDA in data science actually reveals ground truth about the content without making any underlying assumptions. This is the fact that data scientists use this process to actually understand what type of modeling and hypotheses can be created. Key components of exploratory data analysis include summarizing data, statistical analysis, and visualization of data. Python provides expert tools for exploratory analysis, with pandas for summarizing; scipy, along with others, for statistical analysis; and matplotlib and plotly for visualizations.
In this Free case study, Analytics Educator will show you how to use Machine Learning algorithm to predict the price of used cars (2nd hand cars) accurately. We will emphasize more on the EDA with Python part in this project and use only a single Machine learning algorithm to predict and determine the accuracy.
Craigslist is the world's largest collection of used vehicles for sale, this dataset which includes every used vehicle entry within the United States on Craigslist. We are going to predict the price of 2nd hand cars, so that the seller gets the maximum possible price.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
import os
#setting the path
os.chdir("C:\\Users\\ASUS\\Desktop")
data=pd.read_csv("vehicles.csv")
# create a back up copy
df = data.copy()
pd.set_option('display.max_columns', None)
df.head()
id | url | region | region_url | price | year | manufacturer | model | condition | cylinders | fuel | odometer | title_status | transmission | vin | drive | size | type | paint_color | image_url | description | county | state | lat | long | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7184791621 | https://duluth.craigslist.org/ctd/d/duluth-200... | duluth / superior | https://duluth.craigslist.org | 6995 | 2000.0 | gmc | new sierra 1500 | excellent | 8 cylinders | gas | 167783.0 | clean | automatic | 1GTEK19T3YE366021 | 4wd | NaN | NaN | red | https://images.craigslist.org/00n0n_f06ykBMcdh... | 2000 *** GMC New Sierra 1500 Ext Cab 157.5 WB... | NaN | mn | 46.8433 | -92.2550 |
1 | 7184773187 | https://duluth.craigslist.org/cto/d/saginaw-20... | duluth / superior | https://duluth.craigslist.org | 8750 | 2013.0 | hyundai | sonata | excellent | 4 cylinders | gas | 90821.0 | clean | automatic | NaN | fwd | NaN | NaN | grey | https://images.craigslist.org/00d0d_kgZ6xoeRw2... | For Sale: 2013 Hyundai Sonata GLS - $8750. O... | NaN | mn | 46.9074 | -92.4638 |
2 | 7193375964 | https://newhaven.craigslist.org/cto/d/stratfor... | new haven | https://newhaven.craigslist.org | 10900 | 2013.0 | toyota | prius | good | 4 cylinders | hybrid | 92800.0 | clean | automatic | NaN | fwd | NaN | NaN | blue | https://images.craigslist.org/00d0d_3sHGxPbY2O... | 2013 Prius V Model Two. One owner—must sell my... | NaN | ct | 41.1770 | -73.1336 |
3 | 7195108810 | https://albuquerque.craigslist.org/cto/d/albuq... | albuquerque | https://albuquerque.craigslist.org | 12500 | 2003.0 | mitsubishi | lancer | good | 4 cylinders | gas | NaN | clean | manual | NaN | 4wd | mid-size | sedan | grey | https://images.craigslist.org/00m0m_4a8Pb6JbMG... | 2003 Mitsubishi Lancer Evolution, silver. Abo... | NaN | nm | 35.1868 | -106.6650 |
4 | 7184712241 | https://duluth.craigslist.org/ctd/d/rush-city-... | duluth / superior | https://duluth.craigslist.org | 16995 | 2007.0 | gmc | sierra classic 2500hd | good | 8 cylinders | diesel | 254217.0 | clean | automatic | 1GTHK23D57F102022 | 4wd | full-size | truck | white | https://images.craigslist.org/01414_g093aPtSMW... | **Bad Credit, No Credit... No Problem!**2007 G... | NaN | mn | 45.6836 | -92.9648 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 539759 entries, 0 to 539758 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 539759 non-null int64 1 url 539759 non-null object 2 region 539759 non-null object 3 region_url 539759 non-null object 4 price 539759 non-null int64 5 year 538772 non-null float64 6 manufacturer 516175 non-null object 7 model 531746 non-null object 8 condition 303707 non-null object 9 cylinders 321264 non-null object 10 fuel 536366 non-null object 11 odometer 440783 non-null float64 12 title_status 536819 non-null object 13 transmission 535786 non-null object 14 vin 315349 non-null object 15 drive 383987 non-null object 16 size 168550 non-null object 17 type 392290 non-null object 18 paint_color 365520 non-null object 19 image_url 539740 non-null object 20 description 539738 non-null object 21 county 0 non-null float64 22 state 539759 non-null object 23 lat 530785 non-null float64 24 long 530785 non-null float64 dtypes: float64(5), int64(2), object(18) memory usage: 103.0+ MB
r=df.columns
for i in r:
print(df[i].value_counts())
7194886517 1 7189000001 1 7183550244 1 7196139303 1 7192970529 1 .. 7190568893 1 7191969294 1 7195637696 1 7190396865 1 7190089730 1 Name: id, Length: 423857, dtype: int64 https://orangecounty.craigslist.org/ctd/d/santa-ana-2011-mini-countryman-cooper/7194000283.html 1 https://baltimore.craigslist.org/ctd/d/lutherville-timonium-2011-chevrolet/7188930949.html 1 https://washingtondc.craigslist.org/mld/cto/d/silver-spring-toyota-corolla-2006/7193998974.html 1 https://honolulu.craigslist.org/oah/ctd/d/waipahu-auto-deals2011-chevrolet-camaro/7196201050.html 1 https://fortmyers.craigslist.org/lee/ctd/d/vero-beach-2011-ford-f250-super-duty/7188503273.html 1 .. https://hartford.craigslist.org/ctd/d/westport-2007-avalanche-lt-package-4wd/7196084213.html 1 https://baltimore.craigslist.org/cto/d/baltimore-2010-ford-escape-limited/7184749194.html 1 https://masoncity.craigslist.org/ctd/d/faribault-2014-ford-edge-se-awd-4dr/7190678802.html 1 https://westslope.craigslist.org/ctd/d/evans-2013-ford-150-f-wd-reg-cab-126/7195780250.html 1 https://medford.craigslist.org/ctd/d/portland-2014-chevrolet-suburban-lt-x4/7192045084.html 1 Name: url, Length: 423857, dtype: int64 columbus 3296 springfield 3294 jacksonville 3284 rochester 3087 salem 2985 ... meridian 22 kansas city 15 west virginia (old) 14 st louis 5 fort smith, AR 1 Name: region, Length: 404, dtype: int64 https://salem.craigslist.org 2985 https://kpr.craigslist.org 2983 https://fresno.craigslist.org 2983 https://stlouis.craigslist.org 2978 https://cosprings.craigslist.org 2977 ... https://kirksville.craigslist.org 37 https://nesd.craigslist.org 30 https://swks.craigslist.org 29 https://meridian.craigslist.org 22 https://wv.craigslist.org 14 Name: region_url, Length: 413, dtype: int64 0 35025 3500 3865 8995 3589 7995 3537 4500 3465 ... 28766 1 30815 1 20120 1 10581 1 40812 1 Name: price, Length: 16305, dtype: int64 2017.0 30828 2015.0 21901 2014.0 21773 2013.0 21480 2016.0 21430 ... 1917.0 1 1915.0 1 1913.0 1 1912.0 1 1911.0 1 Name: year, Length: 111, dtype: int64 ford 59786 chevrolet 46603 toyota 26001 honda 18898 nissan 17536 jeep 14224 gmc 12782 dodge 12547 ram 11917 hyundai 8765 subaru 7971 bmw 7717 volkswagen 7121 kia 7069 mercedes-benz 6859 chrysler 5903 buick 4668 cadillac 4615 mazda 3964 lexus 3916 audi 3334 acura 2784 lincoln 2593 pontiac 2323 infiniti 2318 volvo 2097 mitsubishi 1945 mercury 1330 mini 1320 rover 1151 saturn 1083 jaguar 787 fiat 567 tesla 299 alfa-romeo 138 harley-davidson 125 datsun 69 ferrari 64 aston-martin 28 land rover 15 porche 6 morgan 3 hennessey 1 Name: manufacturer, dtype: int64 f-150 6073 silverado 1500 3878 escape 3208 camry 2906 1500 2898 ... tundra sr5 truck 1 compass sport 4x4 low miles 1 f650 super duty regular ca 1 Vandolkswagon Jett57800 1 New Beetle 1 Name: model, Length: 27042, dtype: int64 excellent 85254 good 65262 like new 17880 fair 6724 new 1069 salvage 530 Name: condition, dtype: int64 6 cylinders 70130 4 cylinders 69542 8 cylinders 54091 5 cylinders 1594 10 cylinders 884 other 851 3 cylinders 406 12 cylinders 181 Name: cylinders, dtype: int64 gas 286775 diesel 24414 other 11311 hybrid 3737 electric 977 Name: fuel, dtype: int64 0.0 1799 150000.0 694 130000.0 671 140000.0 667 160000.0 645 ... 17925.0 1 194413.0 1 48603.0 1 25107.0 1 47171.0 1 Name: odometer, Length: 89554, dtype: int64 clean 315607 rebuilt 6017 salvage 3248 lien 1963 missing 711 parts only 213 Name: title_status, dtype: int64 automatic 292619 manual 23647 other 11799 Name: transmission, dtype: int64 1HTMRAAM19J040080 133 1FVDBSDC34HM88268 125 1FVDCXBS48HZ75358 118 77777777777777777 70 2NKHHJ7X2HM136714 60 ... 3FADP4TJ5DM149893 1 2T2BK1BA3FC297838 1 1FMCU9GD8JUC87396 1 1FAHP2E84GG134039 1 1FTNF1CF7EKD17926 1 Name: vin, Length: 103588, dtype: int64 4wd 99347 fwd 89632 rwd 42140 Name: drive, dtype: int64 full-size 52820 mid-size 30772 compact 16371 sub-compact 2664 Name: size, dtype: int64 sedan 64842 SUV 62106 truck 26967 pickup 24276 coupe 12931 other 12866 hatchback 9574 wagon 8332 van 7287 convertible 5854 mini-van 5213 offroad 530 bus 379 Name: type, dtype: int64 white 57646 black 43843 silver 32687 blue 22988 red 22395 grey 21752 custom 6299 green 5983 brown 4618 yellow 1775 orange 1502 purple 715 Name: paint_color, dtype: int64 https://images.craigslist.org/00H0H_lwWjXSEWNa7_0x20oM_600x450.jpg 2106 https://images.craigslist.org/00l0l_fl36gaeOcIT_0cU09G_600x450.jpg 133 https://images.craigslist.org/00a0a_2RYSGJJExad_0bM08P_600x450.jpg 125 https://images.craigslist.org/00202_l590CdXzJSo_001001_600x450.jpg 122 https://images.craigslist.org/01010_d7IBA4PgcmC_09e05H_600x450.jpg 115 ... https://images.craigslist.org/00k0k_adpaDg8st1T_0x20oM_600x450.jpg 1 https://images.craigslist.org/00C0C_4mpMrhwWWPp_0kE0fu_600x450.jpg 1 https://images.craigslist.org/00909_fdlnr1QHlhn_0ak07K_600x450.jpg 1 https://images.craigslist.org/01212_eB1nL5MODmY_09Q07v_600x450.jpg 1 https://images.craigslist.org/00w0w_78zjb9qds2U_0cU09G_600x450.jpg 1 Name: image_url, Length: 222413, dtype: int64 VISIT OUR WEBSITE WWW.DIRECTDETAILOFOH.COM HTTPS://DIRECTDETAILOHIO.COM/ WE HAVE A SELECTION OF CARS,TRUCKS,SUV'S. GUARANTEED APPROVAL. BUY HERE PAY HERE. NO CREDIT CHECKS! NO INCOME VERIFICATION! Vehicle is AS-IS NO Warranty. Information deemed reliable, but not guaranteed. Interested parties should confirm all data before relying on it to make a purchase decision. All prices and specifications are subject to change without notice. Prices may not include additional fees such as government fees and taxes, title and registration fees, finance charges, dealer document preparation fees, processing fees, and emission testing and compliance charges. 144 Frontier Auto Sales wants you to know we're about more than just selling cars our mission is to build long-lasting relationships. We want to grow our Frontier Family and see you send us your family and friends. We accept trades! (tow it, drive it or push it in.) We purchase vehicles on the spot. We even cosign vehicles if you need to sell yours let us do the work for you. Let us help find the perfect financing for you and your family’s needs. Spend 3 minutes to fill out a hassle-free secure application with Frontier Auto Sales and let us work hard to get you the better APR & Terms you deserve. Don’t forget Frontier Auto Sales has Alaska’s Best referral reward $300 for every individual you send that completes a purchase from Frontier Auto Sales. Disclaimer: All vehicles subject to prior sale. We reserve the right to make changes without notice and are not responsible for errors or omissions. All prices exclude dealer service fees and DMV fees, any finance charges. **Warranties may be available on all models for an additional cost** For more information on this and other vehicles please see our website: www.frontierautosales.org Frontier Auto Sales Frontier Auto Sales 1300 E. Dowling Rd. Anchorage, AK 99518 Call: 907-561-9900 Text: 907-717-7457 Like us on Facebook! Store Hours: Monday - Friday 9 AM - 8 PM Saturday 10 AM - 7 PM Sunday CLOSED 140 HEC Leasing Call Russ or Gary at 888-366-1571 Reduced Down Payments! Factory Warranty Included With Each Truck! Detroit DD13 450HP Engine Fuller 10-Speed Ultrashift Plus Automatic Trans. 342 RAR 240" WB Air Slide 5th Wheel w/In Cab Release Dual 100 Gallon Fuel Tanks 22.5 LP Tires 6 Pol, Aluminum Wheels 72" Double Bunk Sleeper w/ Premium Interior Freightliner Parksmart Batt HVAC System Auxiliary Heater in Sleepers Dual High Back Air Ride Seats AUTOMATIC TRANSMISSIONS!!! We make financing decisions IN HOUSE! Low weekly rates! We make equipment acquisition easy & affordable. We offer flexible financing, and it is our mission to provide you with a transportation solution that meets your hauling needs and fits within your budget. Our lease to own arrangements are designed to making acquiring a truck as painless as possible. We are a small, family-oriented company that provides the personalized service you need to ensure you are acquiring the right truck under the right terms. Leasing is a great way to get your small business off the ground or quickly add capacity to your current operations. To learn more about our truck leasing and financing services, contact us today. Call one of our VP of Sales - Gary Carrigan, Russ Carrigan at 888-366-1571! Peterbilt, Kenworth, Freightliner, Western Star, International, Crew Cab, Mack,Freightliner, Kenworth, Peterbilt, Owner, Operator, Over the Road, Trucking, Drivers, Semi, Truck, International, Day Cab, Cascadia, Courrier, Over The Road, Leasing Semi, Volvo, Commercial, Used Trucks, 99 Pennington Ford, and our sister store, right across the street, Pennington Chevrolet Buick, have been working hard to provide you with a Premier selection of used cars and trucks. We are dedicated to you, our customer. Front to back, we will be here for everything. From helping you search to find the car or truck of your dreams, to delivery. Making your Car shopping experience easier is what we do! Be sure to check out the rest of our inventory, and call with any questions. 618-783-2385. As always here at The Pennington Auto Group, we have financing available and all trades are welcomed. Stop by Newton Illinois, Premier Car Dealership or give us a call, we look forward to doing business with you!!! Pennington Ford & Polaris 806 S Van Buren Newton, IL 62448 618-783-2385 OVER 200+ NEW/USED VEHICLES IN STOCK!! CHECK OUT OUR INVENTORY WWW.PENNINGTONFORD.COM WWW.PENNINGTONCHEVROLETBUICK.COM 79 www.TEXASDIESELSTORE.com CALL OR TEXT 254-247-2172 2016 GMC SIERRA 2500HD SLT - 4X4 6.6L DURAMAX TURBO DIESEL ALLISON AUTOMATIC TRANSMISSION 4 WHEEL DRIVE - z71 158K MILES LEATHER - LOADED! HEATED AND COOLED SEATS TOW HITCH FACTORY NAVIGATION 18" FUEL WHEELS 295/70R18 COOPER TIRES FIXED CENTER CONSOLE SLT PACKAGE W/ LEATHER, BACK UP CAMERA, NAVIGATION & MORE 100% RUST FREE TRADE INS? YES, WE TAKE THEM! CONTACT US WITH YOUR VEHICLE INFO FOR A FREE APPRAISAL! FREE SHIPPING IN THE STATE OF TEXAS!!! WE ALSO OFFER PICK UP FROM THE AUSTIN AIRPORT! FLY IN & DRIVE HOME WITH UP TO A $500 REIMBURSEMENT IN AIRFARE! WE SPECIALIZE IN HARD TO FIND DIESELS! CHECK OUT OUR WEBSITE! OUR INVENTORY IS CONSTANTLY CHANGING WITH NEW ARRIVALS DAILY! CALL OR TEXT 254-247-2172 You are looking at an All Terrain HD edition 2016 GMC SIERRA 2500HD crew cab SLT pick up with the popular 6.6L Duramax turbo diesel engine and Allison automatic transmission! This truck is BEAUTIFUL and super clean! The truck is all stock! Never been hitched! Non smoker truck! The interior looks GREAT as well with leather seats and fixed console with touch screen navigation on the dash! This truck is hard loaded with all important factory options including heated seats, navigation also is smoke free! Does have a few hail dings on the hood. Overall very clean truck! Dual climate control and heated seats for everybody to be comfortable! 100% free of rust! Recently serviced and ready for a new owner! Call or text us today for walk-around videos or pictures! Texas Diesel Store is a NATIONWIDE dealer! We service anyone, anywhere, anytime! We offer shipping options, we can also pick you up from the airport and bring you directly to the dealership! Call or Txt us for a shipping quote!! We also offer extremely competitive Financing options! There is a free credit application on our website. We work hard to have the best financing options available with the most competitive rates. Click the link to fill out a Credit Application: http://texasdieselstore.com/financing Contact Us Today @ 254-247-2172 Shipping and reimbursements are subject to the completion of the transaction at our designated list price. Airfare reimbursements will be refunded at the completion of the deal as long as an airfare receipt is provided for our records. 67 ... 2006 Chrysler Sebring GTC Convertible Offered by: Affordable Auto Sales Albuquerque — (505) 431-4233 — $5,995 Power Steering,Power Brakes,Power Door Locks,Power Windows,Power Drivers Seat,Radial Tires,Gauge Cluster,Trip Odometer,Tachometer,Air Conditioning,Tilt Steering Wheel,Cruise Control,Tinted Glass,Driver Side Air Bag,Front Bucket Seats,Reclining Seats,Compact Disc Player,Anti-Lock Braking System,Alloy Wheels,Rear Spoiler,Clock,Interval Wipers,Rear Defroster,Remote Trunk Lid,Console,Carpeting,Vanity Mirror,Day/Night Lever,Dual Sport Mirrors,Driver Side Remote Mirror,Velour Upholstery,Accent Stripes,Body Side Moldings,Center Arm Rest,Map Lights,Inside Hood Release,Fold Down Rear Seat,Maintenance Free Battery,Halogen Headlights,Airbags - Front - Dual,Air Conditioning - Front,Emergency Interior Trunk Release,Reading Lights Front,Child Seat Anchors,Cruise Control,Multi-function Remote Trunk Release,Multi-function Remote Keyless Entry,Front Power Outlet,Center Console Front Console With Storage,Driver Seat Power Adjustments,Exterior Entry Lights,Warnings And Reminders Low Fuel Level,Headlights Auto Delay Off,Power Door Locks,Exterior Mirrors Power,Front Suspension Type: Double Wishbones,Front Suspension Classification: Independent,Convertible Roof Soft Top,Seats Vinyl Upholstery,Seatbelts Seatbelt Pretensioners,Rear Seats Bench,Tachometer,Convertible Rear Window Glass,Convertible Roof Power,Rear Spoiler,Seats Front Seat Type: Bucket,Front Headrests Adjustable,Windows Front Wipers: Intermittent,Power Windows,Suspension Stabilizer Bar(s): Front,Suspension Stabilizer Bar(s): Rear,Windows Rear Defogger,AM/FM STEREO RADIO,PASSENGER SIDE AIR BAG Affordable Auto Sales Albuquerque Year: 2006 Make: Chrysler Model: Sebring Series: GTC Convertible VIN: 1C3EL75R16N153709 Stock #: 153709 Condition: Used Mileage: 157,901 MPG: 21 city / 28 highway Exterior: Green Interior: Dark Slate Gray Body: Convertible Transmission: Automatic Engine: 6 Cylinder ⚡⚡️⚡⚡️⚡️⚡️⚡️⚡️⚡️⚡⚡️⚡️ CALL/TEXT (505) 431-4233⚡️⚡️⚡️⚡️⚡️⚡️⚡️⚡️⚡️⚡️⚡⚡️ 🚗 🚕 🚙 🚌 🚎 🏎 - AFFORDABLE AUTO SALES - QUALITY RELIABLE CARS - 🚒 🚐 🚚 🚛 🚜 🏍 ✅ DRIVE AWAY TODAY AT A GREAT CASH PRICE!!!!! ☮️ EASY FINANCING FOR ALL CREDIT SITUATIONS - APPROVED! ☮️ WE EVEN OFFER BUY HERE PAY HERE - WITH APPROVED DOWN PAYMENT! 💥💥💥 FOR FAST PRE-APPROVALS GO TO: 💥💥💥 https://ssl-wwwaffordableautosalesnmcom.dcs-cms.com/creditapp.aspx ☮️ TRADE-INS WELCOME, WE BUY USED CARS! ☮️ VEHICLE HISTORY REPORTS!! CARFAX! ☎️ SE HABLA ESPAÑOL! MAL CREDITO, SIN CREDITO -- NO PROBLEMAS! LLAME O TEXTO (702) 503-2373 ↪️ FOR MORE ADS BY THIS DEALER GO TO: https://albuquerque.craigslist.org/search/ctd?query=affordable+auto+sales More vehicle details: affordable-auto-sales-albuquerque.hammerwebsites.net/v/qTRRfScX Address: 4525 Lomas Blvd NE Albuquerque, NM 87110 Phone: (505) 431-4233 Website: www.affordableautosalesnm.com 📲 ☎️ Call or text (505) 431-4233 for quick answers to your questions about this Chrysler Sebring Your message will always be answered by a real human — never an automated system. Disclaimer: Affordable Auto Sales Albuquerque will never sell, share, or spam your mobile number. Standard text messaging rates may apply. All vehicles are subject to prior sale. We reserve the right to make changes without notice and are not responsible for error or omissions. All prices exclude government fees and taxes, any finance charges and any emissions test charge. It is responsibility of the buyer to call and verify all information in this ad.\r \r ALL PRICES POSTED ARE CASH PRICES ,,, FINANCING IS AVAILABLE,,, ADDITIONAL FEES MAY APPLY\r \r \r In house financing is only available on select vehicles. See dealer for details 2006 Chrysler Sebring GTC Convertible 6828bfd01ab24e88a067a9c2d09416c6 1 Very solid 2001 Ford expedition Eddie bower ed. Great tranny and powerhouse. Sun roof has issues. Duck tape works at pleez. Daily driver but need cash badly. Walk or downgrade . Gotta have 650. No bickering. 650 or keep. No title but not stolen. Have reg. Tag., insur. In name now. Title issues get troublesome. With title it's worth 2500 easy cash. show contact info 1 2009 Buick LaCrosse CXL 133k miles Leather Heated seats Dual power seats Still inspected through 2021 and will have zero issues passing again No engine light or issues Runs and drives smooth and quiet AC is ice cold Heat blows hot for the unfortunate upcoming winter Tires are good Brakes are good No pulsating in the pedal or noises No suspension clunks or noises This car runs and drives 100% Bring $3700 and drive it away 1 2014 Subaru XV Crosstrek Premium Huebner Chevrolet Subaru330-915-31751155 Canton RD NWCarrollton, OH 44615 Price: $15,000Mileage: 73,515Stock #: 64628A Type: UsedTrim: PremiumBodystyle: HatchbackTransmission: Continuously VariableDrive: AWDEngine: 2.0L 4-cylFuel Type: Regular GasolineExterior Color: Crystal Black SilicaInterior Color: IvorySeats: 5 --- LOW MILES!! --- AWD --- NAVIGATION!! --- BACKUP CAMERA --- BLUETOOTH --- CVT - 33 MPG!!! --- 17 ALLOY WHEELS --- FULL POWER ACCESSORIES --- STEERING WHEEL CONTROLS --- HEATED SEATS --- REAR SPOILER --- FOG LIGHTS --- ROOF RAILS --- UNIVERSAL HOME REMOTE ---We've just taken in an ADVENTUROUS, XV Crosstrek 2.0i Premium with just 73,542 miles. The beautiful, Crystal Black Silica exterior and the Ivory cloth interior are both in very nice shape.This 2.0i XV Crosstrek Premium comes with the 2.0L engine, paired with the Lineartronic CVT transmission, and AWD. When equipped with the CVT, the transmission shifts seamlessly through fixed ratios for optimum engine control, getting you improved fuel economy rates of 33 MPG! Plus, Subaru is a leader in All-Wheel-Drive technology so the Crosstrek boasts one of the best systems in it's class!You get some really great features on the XV Crosstrek, like heated seats, premium 17 charcoal and silver alloy wheels, fog lights, roof rails, heated side mirrors, power accessories, cruise control and Bluetooth connectivity for your phone and audio streaming. For sound, you gain a 6-Speaker Audio System that features a CD player, USB port and an iPod/MP3 audio input.If you want fun and versatility all rolled into one small package, then this sharp looking, adventurous 2014 XV Crosstrek Premium is just what you've been looking for!! Here at HUEBNER CHEVROLET SUBARU we make sure to put every used vehicle through an extensive 100 Point Safety Inspection so you can buy and drive with peace of mind. Come in for a no pressure test drive today! http://www.autofusion.com/AF2/vdp/47098108.html 1 2007 Nissan Maxima Very Strong Engine Still Running and Driving, Heated Seats, Touchscreen Radio, Bluetooth, Sunroof, Backup Camera, Blind Spot Camera, Leather Seats, Power Seats, Reclining Seats in the Back, Ice Cold Ac, Push to Start, Remote Key, Keyless Entry, Mini Fridge/cooler, Power Windows, Runs Fine No Issues Ac Blows Cold, Leather Seats Moonroof Tinted Windows Bluetooth & Aux Cord, Excellent Condition Second Owner, Original Low Miles, New Tires, All Power, 6 Cd Player, Sirius Radio, a/c and Heater Works Excellent and It's a Very Clean Truck Inside and Outside. 1 Name: description, Length: 275765, dtype: int64 Series([], Name: county, dtype: int64) ca 45798 fl 27413 tx 23411 ny 18186 or 17312 oh 16920 mi 15995 nc 15179 wa 14260 wi 12744 pa 12376 co 11996 tn 10722 va 10580 nj 10159 il 10143 id 9482 ia 9005 mn 8965 az 8353 mt 7580 ma 7548 ok 6481 sc 6470 in 6028 ga 6012 ks 5875 al 5232 ct 5082 mo 4915 ky 4662 md 4468 ar 4320 nm 4151 ak 3973 nv 3081 hi 2969 nh 2958 me 2941 dc 2936 la 2935 vt 2588 ri 2530 ne 2087 sd 1756 ms 1399 ut 1039 wy 844 de 839 wv 693 nd 466 Name: state, dtype: int64 33.7792 4259 43.1824 2769 46.2348 2026 40.2043 1665 40.4688 1558 ... 41.8691 1 29.6278 1 32.6157 1 58.3605 1 44.0055 1 Name: lat, Length: 41175, dtype: int64 -84.4118 4259 -84.1122 2767 -119.1280 2071 -74.1779 1665 -74.2817 1558 ... -77.4883 1 -94.7044 1 -84.3864 1 -72.5057 1 -88.2199 1 Name: long, Length: 41492, dtype: int64
df= df.drop(columns=['id','url', 'region_url', 'vin', 'image_url', 'description', 'lat', 'long','county','region'], axis=1)
df.head()
price | year | manufacturer | model | condition | cylinders | fuel | odometer | title_status | transmission | drive | size | type | paint_color | state | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6995 | 2000.0 | gmc | new sierra 1500 | excellent | 8 cylinders | gas | 167783.0 | clean | automatic | 4wd | NaN | NaN | red | mn |
1 | 8750 | 2013.0 | hyundai | sonata | excellent | 4 cylinders | gas | 90821.0 | clean | automatic | fwd | NaN | NaN | grey | mn |
2 | 10900 | 2013.0 | toyota | prius | good | 4 cylinders | hybrid | 92800.0 | clean | automatic | fwd | NaN | NaN | blue | ct |
3 | 12500 | 2003.0 | mitsubishi | lancer | good | 4 cylinders | gas | NaN | clean | manual | 4wd | mid-size | sedan | grey | nm |
4 | 16995 | 2007.0 | gmc | sierra classic 2500hd | good | 8 cylinders | diesel | 254217.0 | clean | automatic | 4wd | full-size | truck | white | mn |
df.describe()
price | year | odometer | |
---|---|---|---|
count | 4.238570e+05 | 328743.000000 | 2.705850e+05 |
mean | 5.779437e+04 | 2009.827358 | 9.942578e+04 |
std | 1.002581e+07 | 10.135809 | 1.204833e+05 |
min | 0.000000e+00 | 1900.000000 | 0.000000e+00 |
25% | 4.499000e+03 | 2007.000000 | 4.432400e+04 |
50% | 9.995000e+03 | 2012.000000 | 9.174900e+04 |
75% | 1.899500e+04 | 2016.000000 | 1.372370e+05 |
max | 3.808256e+09 | 2021.000000 | 1.000000e+07 |
Lets see if any outliers in Target variable as we remove them the model can be more accurate because they cause a bring a lot of difference in values of mean and SD.
the difference between 75% value and max value is too large so lets leave 10% values at both ends of a distribution
rr=sorted(df["price"])
quantile1, quantile3= np.percentile(rr,[10,90])
print(quantile1,quantile3)
389.6000000000058 31500.0
df=df[(df.price < 27500) & (df.price >= 500 )]
df.shape
(322693, 15)
r=sorted(df["odometer"])
r
[137.0, 45431.0, 48932.0, 49858.0, 64515.0, 76202.0, 77200.0, 79160.0, 81900.0, 86942.0, 88000.0, 89156.0, 89875.0, 90000.0, 90112.0, 90821.0, 91840.0, 92800.0, 94800.0, 102000.0, 102462.0, 102825.0, 103000.0, 115000.0, 116711.0, 124995.0, 125000.0, 126207.0, 128000.0, 135000.0, 138588.0, 142835.0, 143528.0, 148334.0, 150959.0, 167783.0, nan, 108124.0, 145955.0, 152415.0, 159000.0, 160600.0, 170259.0, 170684.0, 176386.0, 178054.0, 178108.0, 181986.0, 184817.0, 188406.0, 196913.0, 200000.0, 208016.0, 210865.0, 223470.0, 254217.0, 271000.0, 309621.0, nan, nan, nan, 299000.0, nan, nan, 8000.0, 36000.0, 53300.0, 62000.0, 65000.0, 68559.0, 73000.0, 82000.0, 91806.0, 94448.0, 100460.0, 110000.0, 113000.0, 130000.0, 143700.0, 156559.0, 167828.0, 186200.0, 229937.0, nan, 85000.0, 176000.0, 204496.0, 230500.0, 242613.0, 262000.0, nan, nan, 208453.0, nan, nan, nan, nan, nan, nan, nan, nan, 196000.0, 260607.0, nan, 500.0, 15414.0, 38000.0, 50000.0, 56289.0, 58764.0, 63000.0, 64596.0, 76164.0, 85122.0, 85556.0, 87000.0, 87742.0, 92000.0, 104897.0, 104898.0, 105000.0, 105880.0, 108869.0, 115000.0, 116751.0, 119592.0, 123000.0, 125147.0, 148212.0, 152000.0, 153000.0, 158000.0, 160025.0, 165000.0, 192420.0, nan, 179804.0, nan, 216800.0, nan, 165000.0, 193000.0, nan, nan, nan, 35224.0, 40629.0, 50273.0, 82740.0, 103000.0, 105000.0, 113400.0, 124000.0, 140000.0, 141118.0, 152000.0, 195413.0, nan, nan, nan, 111563.0, 144000.0, 147893.0, 155000.0, 158000.0, 192000.0, 251000.0, nan, 144263.0, 172000.0, 187000.0, 195756.0, 210000.0, nan, nan, nan, nan, 168000.0, 181000.0, 223000.0, 250000.0, nan, 238000.0, 319000.0, nan, 1699.0, 27000.0, 32428.0, 53646.0, 66782.0, 74408.0, 78507.0, 86643.0, 93559.0, 104033.0, 109527.0, 119598.0, 131164.0, 142055.0, 170623.0, nan, 142763.0, nan, nan, 22289.0, 28269.0, 50776.0, 73766.0, 74273.0, 79091.0, 80242.0, 90822.0, 102783.0, 110346.0, 130348.0, 134177.0, 176580.0, nan, 6988.0, 13479.0, 32276.0, 40234.0, 40533.0, 43400.0, 48900.0, 51636.0, 55268.0, 58112.0, 71416.0, 77000.0, 79198.0, 80566.0, 84411.0, 93060.0, 96000.0, 97540.0, 99898.0, 103369.0, 110000.0, 112103.0, 112447.0, 116022.0, 120000.0, 120940.0, 127411.0, 145996.0, 146660.0, 150778.0, 152000.0, 161201.0, 178318.0, 179000.0, 185423.0, 187867.0, 196611.0, 198117.0, 198334.0, nan, nan, nan, nan, nan, nan, 19734.0, 21449.0, 35677.0, 37885.0, 43961.0, 66850.0, 72530.0, 75771.0, 77945.0, 79639.0, 100761.0, 105000.0, 110400.0, 132000.0, 133000.0, 134041.0, 138000.0, 149273.0, 151000.0, 154025.0, 159147.0, 160000.0, 178000.0, 179856.0, nan, nan, nan, 161776.0, 164030.0, nan, nan, nan, nan, 220000.0, nan, nan, nan, nan, 13848.0, 21800.0, 46145.0, 47055.0, 54062.0, 65748.0, 90261.0, 91001.0, 93988.0, 95000.0, 97426.0, 100694.0, 102000.0, 103240.0, 103600.0, 105311.0, 109210.0, 110380.0, 113000.0, 113000.0, 122269.0, 126000.0, 134000.0, nan, nan, 118587.0, 141690.0, 145155.0, 190000.0, 200000.0, 200245.0, 215000.0, nan, nan, nan, 121365.0, 168000.0, 174599.0, 179942.0, 290000.0, nan, nan, nan, nan, 299.0, 17767.0, 23595.0, 24335.0, 26704.0, 30628.0, 35322.0, 35500.0, 35650.0, 45024.0, 67500.0, 68000.0, 68670.0, 69618.0, 75346.0, 80275.0, 84550.0, 85000.0, 94877.0, 95500.0, 99000.0, 102270.0, 102830.0, 106000.0, 110000.0, 111090.0, 112000.0, 114000.0, 116435.0, 121352.0, 128929.0, 131000.0, 136000.0, 140000.0, 143000.0, 155000.0, 156000.0, 157000.0, 158000.0, 159000.0, 161626.0, 253000.0, nan, nan, nan, nan, 140000.0, 183000.0, nan, nan, nan, nan, 131000.0, 134523.0, 170000.0, 174000.0, nan, 22432.0, 26000.0, 27000.0, 28777.0, 35623.0, 36442.0, 38475.0, 40574.0, 43708.0, 44026.0, 45000.0, 49615.0, 69000.0, 78000.0, 82000.0, 84000.0, 94025.0, 95000.0, 105131.0, 106182.0, 106182.0, 107500.0, 113000.0, 120000.0, 125000.0, 138000.0, nan, nan, nan, nan, nan, nan, 139900.0, 147000.0, 165000.0, nan, nan, 38579.0, 96857.0, nan, 8805.0, 21000.0, 38475.0, 46000.0, 69761.0, 78869.0, 80945.0, 80945.0, 92993.0, 96000.0, 96857.0, 101758.0, 115000.0, 120000.0, 123212.0, 127146.0, 128518.0, 128647.0, 130000.0, 133000.0, 133183.0, 134000.0, 135362.0, 141000.0, 144000.0, 149000.0, 156201.0, 156940.0, 164715.0, 170000.0, 182000.0, 182000.0, 192399.0, 197000.0, nan, nan, nan, 15776.0, 16000.0, 22501.0, 60952.0, 73079.0, 79307.0, 82000.0, 97000.0, 100755.0, 105000.0, 114000.0, 116998.0, 123000.0, 125770.0, 127000.0, 128000.0, 128000.0, 129000.0, 130469.0, 133000.0, 135000.0, 135323.0, 138095.0, 143376.0, 145000.0, 149260.0, 160016.0, 162000.0, 170000.0, 172170.0, 177000.0, 189000.0, 189723.0, 190610.0, 203000.0, nan, nan, 197000.0, 1655000.0, nan, nan, 200500.0, 246000.0, nan, 164000.0, 172000.0, nan, nan, nan, 162000.0, 234000.0, nan, nan, nan, 228000.0, nan, nan, nan, nan, 99800.0, 129000.0, 157000.0, 160000.0, nan, 15822.0, 18680.0, 31322.0, 38109.0, 38287.0, 42303.0, 42703.0, 42819.0, 52247.0, 56634.0, 62922.0, 64262.0, 67132.0, 68559.0, 76301.0, 76867.0, 78000.0, 83000.0, 86865.0, 87875.0, 91806.0, 92483.0, 95195.0, 98000.0, 109000.0, 115048.0, 118369.0, 123000.0, 124000.0, 130055.0, 142618.0, 149553.0, 150000.0, 158500.0, 164000.0, 169700.0, 186000.0, nan, 17831.0, 24000.0, 37403.0, 44000.0, 51900.0, 54040.0, 56289.0, 56544.0, 57900.0, 64669.0, 74875.0, 75100.0, 78099.0, 79900.0, 96000.0, 102903.0, 105567.0, 122000.0, 124000.0, 131500.0, 135000.0, 145500.0, 147318.0, 148500.0, 153741.0, 167882.0, 170000.0, 172817.0, 174000.0, 186709.0, 188250.0, 192223.0, 198895.0, 198900.0, 206000.0, 220000.0, 237600.0, 238000.0, 277000.0, nan, nan, nan, 76373.0, 139836.0, nan, nan, nan, 12783.0, 22909.0, 24586.0, 38437.0, 55802.0, 58098.0, 58613.0, 58893.0, 64809.0, 72095.0, 92367.0, 101270.0, 128000.0, 167160.0, nan, 64982.0, 76074.0, 84085.0, 97846.0, 109926.0, 124000.0, 138154.0, 140123.0, 140290.0, 142786.0, 159219.0, 168000.0, 169072.0, 176617.0, 194000.0, 200000.0, nan, nan, nan, 4556.0, 8473.0, 10824.0, 25277.0, 25739.0, 31908.0, 38018.0, 50000.0, 50670.0, 58828.0, 74662.0, 87000.0, 91601.0, 94000.0, 99762.0, 101827.0, 101856.0, 103000.0, 120000.0, 144000.0, 153342.0, 195230.0, nan, 175325.0, nan, 108260.0, 110000.0, 130000.0, 180000.0, 191206.0, 206000.0, 1000000.0, nan, nan, nan, nan, 4356.0, 10213.0, 13147.0, 18416.0, 25277.0, 26237.0, 29711.0, 37990.0, 48789.0, 49539.0, 60137.0, 70200.0, 70206.0, 73409.0, 73988.0, 75680.0, 84483.0, 86568.0, 89329.0, 96200.0, 100455.0, 105593.0, 110365.0, 114299.0, 117000.0, 120298.0, 121005.0, 138365.0, 154000.0, 155000.0, 156008.0, 159000.0, 163000.0, 177000.0, 188000.0, 192000.0, 197000.0, 224000.0, 248000.0, nan, nan, 237000.0, nan, 270330.0, nan, 32752.0, 41000.0, 69110.0, 71455.0, 72869.0, 74057.0, nan, 49880.0, 77139.0, 83100.0, 88652.0, 90095.0, 94280.0, 128322.0, 129094.0, 161000.0, nan, 87828.0, 160000.0, 167384.0, nan, nan, nan, nan, nan, 10700.0, 34600.0, 131000.0, 144000.0, 182000.0, nan, 221000.0, nan, nan, nan, nan, nan, nan, nan, 0.0, 14190.0, 15000.0, 21007.0, 31286.0, 36824.0, 36990.0, 37932.0, 51728.0, 54269.0, 57308.0, 59097.0, 64507.0, 67336.0, 69245.0, 71032.0, 73012.0, 75036.0, 78470.0, 80559.0, 92481.0, 99220.0, 115000.0, 125000.0, 136000.0, 139000.0, 142718.0, 145000.0, 147000.0, 178174.0, 184285.0, 500000.0, nan, nan, 177558.0, 178045.0, 180000.0, 233840.0, nan, nan, nan, 15822.0, 18120.0, 18309.0, 20270.0, 25577.0, 26139.0, 27057.0, 27637.0, 32639.0, 32752.0, 33924.0, 34933.0, 38287.0, 39524.0, 40142.0, 42703.0, 58098.0, 66199.0, 82000.0, 131456.0, nan, nan, nan, 44000.0, 82854.0, 87791.0, 119000.0, 120432.0, 125103.0, nan, nan, 130066.0, 133280.0, 143000.0, 145331.0, 178000.0, 217000.0, 228000.0, nan, 42739.0, 56593.0, 110004.0, nan, nan, nan, 28598.0, 29147.0, 29266.0, 32336.0, 41546.0, 49683.0, 50067.0, 52926.0, 52985.0, 58428.0, 63599.0, 70579.0, 72664.0, 98000.0, 99129.0, 100648.0, 102917.0, 109000.0, 119000.0, 128000.0, 134000.0, 148865.0, 152000.0, 158000.0, 165000.0, 179296.0, nan, 150795.0, 182000.0, 183015.0, 192420.0, nan, 172800.0, nan, nan, 23315.0, 26354.0, 29609.0, 32841.0, 34341.0, 34449.0, 44000.0, 49189.0, 53615.0, 54297.0, 55000.0, 56136.0, 60421.0, 69000.0, 89000.0, 94673.0, 97604.0, 102000.0, 109391.0, 109422.0, 116000.0, 126000.0, 133465.0, 135723.0, 140000.0, 141731.0, 142951.0, 145750.0, 158335.0, 159000.0, 182000.0, 199412.0, 205000.0, nan, 249750.0, nan, 163000.0, 177364.0, nan, 1.0, 7700.0, 11675.0, 19775.0, 22332.0, 28418.0, 32061.0, 40000.0, 46400.0, 59902.0, 64000.0, 65000.0, 66098.0, 69000.0, 82814.0, 83586.0, 101635.0, 116154.0, 134117.0, 139415.0, 141773.0, 143157.0, 143267.0, 144370.0, 152946.0, 153796.0, 156741.0, 162919.0, 179200.0, 179854.0, 185326.0, 189600.0, 222000.0, 224139.0, 234000.0, nan, 247181.0, nan, nan, 124.0, 126.0, 131.0, 138.0, 146.0, 4472.0, 8025.0, 8697.0, 18137.0, 29490.0, 38000.0, 60678.0, 61007.0, 64541.0, 70000.0, 74000.0, 77022.0, 90406.0, 124479.0, ...]
df["odometer"].isna().sum()
117593
ax = sns.scatterplot(x="odometer", y="price", data=df)
df["odometer"].max()
10000000.0
df.drop(df[df["odometer"]==64809218.0].index,inplace=True)
df.drop(df[df["odometer"]==0.0].index,inplace=True)
ax = sns.scatterplot(x="odometer", y="price", data=df)
ax.get_xaxis().get_major_formatter().set_scientific(False)
ax.get_yaxis().get_major_formatter().set_scientific(False)
df["odometer"].isna().sum()
0
df=df[(df.odometer < 3000000)]
ax = sns.scatterplot(x="odometer", y="price", data=df)
ax.get_xaxis().get_major_formatter().set_scientific(False)
Now lets see year column
df["odometer"].isna().sum()
0
df["year"].isna().sum()
471
the null values in a year column cannot be replaced so lets eliminate them
df["year"].min()
1919.0
we must alo drop this 0 values
df.drop(df[df["year"]==0.0].index,inplace=True)
df=df.dropna(subset=['year'])
bx = sns.scatterplot(x="year", y="price", data=df)
df=df[(df.year > 1940)]
we have removed outliers from all three numerical columns . now lets how we can remove with nan values if any.
I am taking out % of null values in each column
null_values=df.isnull().sum()
null_values=pd.DataFrame(null_values,columns=['null'])
j=1
sum_tot=len(df)
null_values['percent']=null_values['null']/sum_tot
round(null_values*100,3).sort_values('percent',ascending=False)
null | percent | |
---|---|---|
size | 12960400 | 63.617 |
condition | 7777500 | 38.177 |
cylinders | 6546200 | 32.133 |
paint_color | 4872100 | 23.915 |
drive | 4708300 | 23.111 |
type | 3891800 | 19.103 |
manufacturer | 623500 | 3.061 |
model | 157000 | 0.771 |
fuel | 144800 | 0.711 |
title_status | 112100 | 0.550 |
transmission | 82700 | 0.406 |
price | 0 | 0.000 |
year | 0 | 0.000 |
odometer | 0 | 0.000 |
state | 0 | 0.000 |
df.condition.value_counts()
excellent 60679 good 48590 like new 11258 fair 4796 new 321 salvage 305 Name: condition, dtype: int64
excellent_odo_mean = df[df['condition'] == 'excellent']['odometer'].mean()
good_odo_mean = df[df['condition'] == 'good']['odometer'].mean()
like_new_odo_mean = df[df['condition'] == 'like new']['odometer'].mean()
salvage_odo_mean = df[df['condition'] == 'salvage']['odometer'].mean()
fair_odo_mean = df[df['condition'] == 'fair']['odometer'].mean()
print('Like new average odometer:', round( like_new_odo_mean,2))
print('Excellent average odometer:', round( excellent_odo_mean,2))
print('Good average odometer:', round( good_odo_mean,2))
print('Fair average odometer:', round( fair_odo_mean,2))
print('Salvage average odometer:', round( salvage_odo_mean,2))
Like new average odometer: 85098.04 Excellent average odometer: 112103.1 Good average odometer: 121570.61 Fair average odometer: 182986.37 Salvage average odometer: 168911.82
df.loc[df.year>=2019, 'condition'] = df.loc[df.year>=2019, 'condition'].fillna('new')
df.loc[df['odometer'] <= like_new_odo_mean, 'condition'] = df.loc[df['odometer'] <= like_new_odo_mean, 'condition'].fillna('like new')
df.loc[df['odometer'] >= fair_odo_mean, 'condition'] = df.loc[df['odometer'] >= fair_odo_mean, 'condition'].fillna('fair')
df.loc[((df['odometer'] > good_odo_mean) &
(df['odometer'] <= excellent_odo_mean)), 'condition'] = df.loc[((df['odometer'] > good_odo_mean) &
(df['odometer'] <= excellent_odo_mean)), 'condition'].fillna('excellent')
df.loc[((df['odometer'] > like_new_odo_mean) &
(df['odometer'] <= good_odo_mean)), 'condition'] = df.loc[((df['odometer'] > like_new_odo_mean) &
(df['odometer'] <= good_odo_mean)), 'condition'].fillna('good')
df.loc[((df['odometer'] > good_odo_mean) &
(df['odometer'] <= fair_odo_mean)), 'condition'] = df.loc[((df['odometer'] > good_odo_mean) &
(df['odometer'] <= fair_odo_mean)), 'condition'].fillna('salvage')
null_values=df.isnull().sum()
null_values=pd.DataFrame(null_values,columns=['null'])
j=1
sum_tot=len(df)
null_values['percent']=null_values['null']/sum_tot
round(null_values*100,3).sort_values('percent',ascending=False)
null | percent | |
---|---|---|
size | 12960400 | 63.617 |
cylinders | 6546200 | 32.133 |
paint_color | 4872100 | 23.915 |
drive | 4708300 | 23.111 |
type | 3891800 | 19.103 |
manufacturer | 623500 | 3.061 |
model | 157000 | 0.771 |
fuel | 144800 | 0.711 |
title_status | 112100 | 0.550 |
transmission | 82700 | 0.406 |
price | 0 | 0.000 |
year | 0 | 0.000 |
condition | 0 | 0.000 |
odometer | 0 | 0.000 |
state | 0 | 0.000 |
df=df.dropna(subset=['title_status','fuel','transmission','model','manufacturer'])
df=df.drop(["size"],axis=1)
null_values=df.isnull().sum()
null_values=pd.DataFrame(null_values,columns=['null'])
j=1
sum_tot=len(df)
null_values['percent']=null_values['null']/sum_tot
round(null_values*100,3).sort_values('percent',ascending=False)
null | percent | |
---|---|---|
cylinders | 6176000 | 31.960 |
paint_color | 4617300 | 23.894 |
drive | 4431900 | 22.934 |
type | 3672300 | 19.004 |
price | 0 | 0.000 |
year | 0 | 0.000 |
manufacturer | 0 | 0.000 |
model | 0 | 0.000 |
condition | 0 | 0.000 |
fuel | 0 | 0.000 |
odometer | 0 | 0.000 |
title_status | 0 | 0.000 |
transmission | 0 | 0.000 |
state | 0 | 0.000 |
df['paint_color'] = df['paint_color'].fillna(method='ffill')
df['drive'] = df['drive'].fillna(method='ffill')
df['type'] = df['type'].fillna(method='ffill')
df['cylinders'] = df['cylinders'].fillna(method='ffill')
df.isnull().sum()
price 0 year 0 manufacturer 0 model 0 condition 0 cylinders 0 fuel 0 odometer 0 title_status 0 transmission 0 drive 0 type 3 paint_color 0 state 0 dtype: int64
df=df.dropna(subset=['cylinders','drive'])
df.isnull().sum()
price 0 year 0 manufacturer 0 model 0 condition 0 cylinders 0 fuel 0 odometer 0 title_status 0 transmission 0 drive 0 type 3 paint_color 0 state 0 dtype: int64
from scipy import stats
sns.pairplot(df)
<seaborn.axisgrid.PairGrid at 0xc734d400>
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 193243 entries, 0 to 423856 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 price 193243 non-null int64 1 year 193243 non-null float64 2 manufacturer 193243 non-null object 3 model 193243 non-null object 4 condition 193243 non-null object 5 cylinders 193243 non-null object 6 fuel 193243 non-null object 7 odometer 193243 non-null float64 8 title_status 193243 non-null object 9 transmission 193243 non-null object 10 drive 193243 non-null object 11 type 193240 non-null object 12 paint_color 193243 non-null object 13 state 193243 non-null object dtypes: float64(2), int64(1), object(11) memory usage: 22.1+ MB
c=df.columns
for i in c:
print(df[i].value_counts())
7995 2595 8995 2523 6995 2386 9995 2372 5995 2241 ... 26699 1 16462 1 10435 1 6341 1 18441 1 Name: price, Length: 8905, dtype: int64 2017.0 15378 2013.0 14735 2014.0 13842 2012.0 13140 2011.0 12678 ... 1958.0 8 1954.0 7 1947.0 3 1952.0 3 1942.0 1 Name: year, Length: 78, dtype: int64 ford 33184 chevrolet 26703 toyota 16707 honda 13674 nissan 12246 jeep 8519 dodge 7617 gmc 6492 hyundai 6423 subaru 6093 volkswagen 5306 bmw 5183 kia 5113 ram 4481 mercedes-benz 4094 chrysler 4082 buick 3251 mazda 3008 cadillac 2760 lexus 2630 audi 2071 acura 1995 lincoln 1734 infiniti 1487 pontiac 1474 volvo 1374 mitsubishi 1340 mini 1030 mercury 938 saturn 779 rover 470 jaguar 415 fiat 410 harley-davidson 68 alfa-romeo 42 datsun 33 tesla 7 land rover 7 porche 2 morgan 1 Name: manufacturer, dtype: int64 f-150 2775 escape 2394 camry 2043 silverado 1500 1947 altima 1825 ... sportage lx awd suv 1 benz e 300 1 s350 1 romeo giulia turbo sport 1 sedona lx lx w/uvo, 1 Name: model, Length: 14357, dtype: int64 good 62848 excellent 58189 like new 37501 salvage 19682 fair 10574 new 4449 Name: condition, dtype: int64 4 cylinders 77762 6 cylinders 69569 8 cylinders 42229 5 cylinders 1934 10 cylinders 792 other 476 3 cylinders 366 12 cylinders 115 Name: cylinders, dtype: int64 gas 176996 diesel 7707 other 5240 hybrid 2827 electric 473 Name: fuel, dtype: int64 150000.0 647 140000.0 617 160000.0 601 130000.0 597 170000.0 563 ... 232100.0 1 166563.0 1 83281.0 1 3764.0 1 113853.0 1 Name: odometer, Length: 71104, dtype: int64 clean 185464 rebuilt 4332 salvage 2131 lien 1130 missing 138 parts only 48 Name: title_status, dtype: int64 automatic 175068 manual 12131 other 6044 Name: transmission, dtype: int64 fwd 89763 4wd 71858 rwd 31622 Name: drive, dtype: int64 sedan 60245 SUV 53032 truck 14423 pickup 13544 coupe 9359 other 9192 hatchback 9128 wagon 7980 van 6410 mini-van 5156 convertible 4293 offroad 331 bus 147 Name: type, dtype: int64 white 45381 black 37263 silver 31001 blue 21752 grey 20148 red 19419 custom 5686 green 5277 brown 4388 yellow 1285 orange 1055 purple 588 Name: paint_color, dtype: int64 ca 16791 fl 12208 mi 9402 oh 9401 ny 8663 tx 8296 wi 7566 nc 6618 pa 6521 or 5775 va 5765 nj 5720 tn 5492 co 5438 ia 5064 wa 4382 id 4239 ma 4048 il 4039 mn 4020 mt 3971 in 3457 ct 3073 sc 2858 ok 2735 al 2688 ky 2667 ks 2646 md 2496 ak 2408 az 2401 mo 2292 ga 2206 ar 1729 nm 1556 nh 1502 nv 1481 ri 1440 vt 1365 me 1340 dc 1293 la 1195 hi 894 sd 875 ne 722 ms 697 ut 515 de 424 wy 326 nd 275 wv 268 Name: state, dtype: int64
ax = sns.barplot(x="condition", y="price", data=df)
Clearly vehicles with condition new has highest price as one expects.
sns.catplot(y="cylinders", x="price",kind = "violin", data=df)
<seaborn.axisgrid.FacetGrid at 0xc734de48>
sns.catplot(x="fuel", y="price", kind="boxen",
data=df)
<seaborn.axisgrid.FacetGrid at 0xc7ad9dd8>
sns.catplot(x="title_status", y="price",kind="violin", data=df)
<seaborn.axisgrid.FacetGrid at 0xc930efd0>
sns.catplot(x="transmission", y="price",kind="bar", palette="ch:.25", data=df)
<seaborn.axisgrid.FacetGrid at 0xd0bdc2e8>
The relation between price and transmission.
sns.violinplot(x=df.drive, y=df.price);
There doesnt seem too be much difference between the first 2 types of drives .
The third one is a bit different.
sns.catplot(y="type", x="price",kind="boxen", data=df);
sns.catplot(y="paint_color", x="price",kind="violin", data=df);
sns.catplot(y="manufacturer", x="price",kind="box", data=df);
from sklearn import preprocessing
import pandas as pd
le = preprocessing.LabelEncoder()
df.columns
Index(['price', 'year', 'manufacturer', 'model', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'drive', 'type', 'paint_color', 'state'], dtype='object')
df[['manufacturer', 'model', 'condition',
'cylinders', 'fuel', 'title_status', 'transmission',
'drive', 'type', 'paint_color', 'state']]=df[['manufacturer', 'model', 'condition',
'cylinders', 'fuel', 'title_status', 'transmission',
'drive', 'type', 'paint_color', 'state']].apply(le.fit_transform)
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
y= df.price
X= df.drop('price',axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)
(154594, 13) (154594,) (38649, 13) (38649,)
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
from sklearn import metrics
from sklearn.metrics import mean_squared_error as MSE
print('Mean Absolute Error:', round(metrics.mean_absolute_error(y_test, y_pred),2))
print('Mean Squared Error:', round(metrics.mean_squared_error(y_test, y_pred),2))
print('Root Mean Squared Error:', round(np.sqrt(metrics.mean_squared_error(y_test, y_pred)),2))
Mean Absolute Error: 1726.32 Mean Squared Error: 7435213.35 Root Mean Squared Error: 2726.76
d = pd.DataFrame()
d["test_y"] = y_test
d["y_pred"] = y_pred
#Mape with formula
d["mp"] = abs((d["test_y"]- d["y_pred"])/d["test_y"])
(d.mp.mean())*100#mape
29.507433494326758